Open CatChenal opened 10 months ago
What happens with the two "parallel" outputs: the .npz file for the Chl sites E shifts and the .csv "matrix" file? About the latter: is having a cvs file a hard requirement? (A python dict can also be saved as a .npy file, then loaded and passed to a pandas.DataFrame directly.)
A from @granepura:
Sure, that is totally fine with me if you want the file as a npy file. I just made the vector script into a cvs file because it was easiest thing for me to do with my remedial coding knowledge. The output looked like this, which I found to be straightforward and easy to work with.
Working with pandas.DataFrame in the end would be easier for me to work with regarding the next steps of the project.
It's fine to use pandas, but I would like to avoid adding an external package requirement if it's not absolutely necessary. You could still use it in your final analysis. The following code block (delimited by bacticks: ```), shows how you would access the data:
import numpy as np
import pandas as pd
npy_file = "free_residue_matrix.npy" # stores the pickled dict
# PRE: dict was saved using:
# np.save(npy_file, free_residue_matrix=free_residue_matrix_dict, allow_pickle=True)
matrix_dict = np.load(npy_file, allow_pickle=True)["free_residue_matrix"]
df = pd.DataFrame.from_dict(matrix_dict)
[...]
@granepura
The modules VectorMS.py
and VectorMS_free.py
differ by one line.
I propose to use only one and create a proper cli with a flag acting as a switch between the two desired outcomes, e.g.: The command below would output the matrix for free residues as "free_residue_matrix.csv"
VectorMS.py --free_residues_only
While the command without the flag would not perform any filtering and create "residue_matrix.csv":
VectorMS.py
Yes, I believe this is much more appropriate.
TODO: Establish package dependencies in pymembrane.
Notes on local installation of (private) repo pymembrane:
git switch temp_mohamed
#switch to most up to date repopip install -e .
@granepura There are five main steps in the "mcce_cdc_pipeline":
Is it likely that you would want to execute them one by one? (The current option perform 1 through 5 with the arguments provided at the command line.)
@granepura When you run vectorMS.py where do you expect the .csv file to be created?
@granepura There are five main steps in the "mcce_cdc_pipeline":
- get ms sample
- get pdbs from sample
- convert pdbs
- output ms_matrix
- calculate energy shifts
Is it likely that you would want to execute them one by one? (The current option perform 1 through 5 with the arguments provided at the command line.)
Yes, exactly each of those should be separate executions that follow in that order.
@granepura When you run vectorMS.py where do you expect the .csv file to be created?
This looks like the the step for (4). out ms_matrix It should be put in the same folder in which the converted pdbs are located.
https://github.com/CatChenal/mcce_dev/issues/10#issuecomment-1832752718 OK, that what I thought!
Yes, exactly each of those should be separate executions that follow in that order.
Oops! I have to amend the splitting of the steps: 1 & 2 cannot be uncoupled as they use the conformers and the MSout class from /bin/ms_analysis.py, which is a processing bottleneck, thus we now have:
Here is an example of what the cli commands would look like:
e.g.: sub-command ms_to_pdbs :: 1.
> mcce_cdc_pipeline.py ms_to_pdbs [args for ms to pdbs]
e.g.: sub-command convert_pdbs :: 2.
> mcce_cdc_pipeline.py convert_pdbs [args for convert pdbs]
e.g.: sub-command matrix_and_shifts :: 3.+4.
> mcce_cdc_pipeline.py matrix_and_shifts [mutually exclusive arg for matrix and shifts calculation]
For the last group, the args choices could be --do_both
(default), --do_matrix
or --do_shifts
, but only a single one would be accepted.
Again, right now we have mcce_cdc_pipeline.py [necessary args]
:: do all steps.
That is what I suspected. At first I assumed step 1 would be sampling you do, so that if you run step2 you use the same sample set. If this is true it doesn't matter if they are not coupled.
However, if you were to run multiple step1s, and you get a different sampling set, then step (1) and step (2) must be uncoupled.
What is this case?
Let me clarify: In step1, the steps "get ms sample" and "pdbs from sample" are coupled because the "ms sample" is not saved to file, the pdbs are created straight away, i.e., we have this:
get ms sample -> iter( [get conformers for one ms -> write pdb from conformers]) for each ms in sample.
So, the final product of step1 is the populated ms_pdbs folder.
step1 will produce the same output as long as the same 'deterministic' sampling parameters are used (obviously, it's never the case if the sampling kind is 'random'). Anyway, I have not implementing logging to file, so to repeat an identical step1, you would have to consult you command history, or "the notes you made when you ran it the first time" ;-)
@granepura, @melrefaiy2018
I need confirmation/correction on the following in cdc_mcce_microstates_Oct19.py
:
dict_cofactors
is unusedextended_pdb_name
is unusedlist_ligands
is unused, but passed onto md_object.save_dict_2_txt
the site energy is calculated for only one ligand/cofactor: 'CLA', which is hard-coded
Also, I would like an example of the filename produced by this:
# parsed_dir = mcce_dir.joinpath("parsed_pdb_output_mc")
# md_scratch_dir = parsed_dir.joinpath(f"CDC_{pdb_name}")
md_object.save_dict_2_txt(
dict_data = occ_dict,
text_file_name = f"{md_scratch_dir}cdc_results/dict_site_energy_shift_most_occ_"f"{list_ligands=}",
)
Thanks.
https://github.com/CatChenal/mcce_dev/issues/10#issuecomment-1834587551 Answering my own question:
That's what the file name in text_file_name
looks like that:
"/path/to/scratch/cdc_results/dict_site_energy_shift_most_occ_list_cofactors=['CLA', 'CLB', 'BCR', 'SQD', 'HOH', 'MEM']"
Really.
Hey Cat! Hope you're doing well. I'm still waiting for Gehan to pass over the cdc_mcce_microstates_Oct19.py file. Once I get my hands on it, I'll be all set to help you out. Any chance you could shoot it my way? Also, I'm curious about your goals in this project. What are you aiming for? Just want to make sure I can assist you the best way possible. By the way, the script is still a work in progress and not quite ready to be shared, so this is why you are getting lots of errors at the moment—I can whip up a custom script for you to save some time, if I understood what you are looking for exactly.
Thanks.
I have combined all the code necessary to run the "MCCE to CDC pipeline" into one cli module, which is now under testing.
The problem, which I knew would happen when using an alpha-level, unpublished code base as a dependency, is that you have changed the temp_mohamed
branch we used to hack-install pymembrane
. Specifically, the run_cdc
module we were using has been deleted (pymembrane.util.CDC.run_cdc).
I believe both @granepura and myself have a copy of your branch predating the changes, so we can use that, but it's a blow to reproducibility.
I see. I agree with you that this is a problem. I think the best way is to try to depend in the main functions. In other words, most of the function I have build in the util directory are subject to changes, so you can have your own util version as this util directory is just meant to facilitate my work in a way that maybe won't be useful to your work. Second point, I believe the following lines are not going to change in the near future:
# load the extended pdb file into the protein atomic object:
protein_atomic = ProteinAtomic(f'{dir}{filename}', "Isia", center_atomic=False, set_atomic_charges=True)
protein_atomic.prepare_pigments('CLA', ChlorophyllAtomic,
q_0011_charges_dict='CLA_mcce',
qy_atoms=('N1B', 'N1D'),
qx_atoms=('N1A', 'N1C'))
# Calculate the shift in the site energy
dict_site_energy_shift_mcce, dict_total_contribution_mcce = calculate_cdc_site_shift(
protein_atomic, dielectric_eff=2.5, protein_contribution=True)
# calculate the total site energy :
dict_site_energy_mcc = calculate_total_site_energy(dict_site_energy_shift_mcce, 14950, 15674)
Now, I think this part will be no longer used since it was meant to use gromacs to generate the partial charges of the protein and now the mcce pdb file should contains the partial charges coming from amber topology:
md_object.prepare_minimization(pdb_path=pdb_path,
# protein_force_field='amber99',
# list_ligands=list_ligands,
# dict_update_residue={'D0003': 'ASP',
# 'D0179': 'ASP',
# 'C0179': 'ASP',
# 'C0003': 'ASP',
# 'B0179': 'ASP',
# 'B0004': 'GLU',
# 'A0179': 'ASP'},
# water_forceField='tip4p',
# ignh=False,
# minimize_ligand=False,
# run_minimization=False
# )
# md_object.build_extended_pdb(dict_cofactors)
Now, the DynamicStructure class, it's not clear for me how I am going to change it but currently I am using it as an interface to access some important method like copying file or saving data stuff like this, so you can depend on the generic method.
Finally, in the latest push on mohamed_temp branch, I tried to organize the files in the util directory in one place, delete unused file and rename files. For example, now we have pigemnts_paramteres, that will have all the data you need related to the pigments. so I believe you should use this version when it's possible.
I hope this answer most of your question, let me know if you still have any other questions.
Thanks for the details.
I had removed all the commented lines in the original code, so we never had the second code block.
To go back to the original question, and please correct me if I'm wrong, what you are saying is that the code in util/CDC/run_cdc.py
was put together for @granepura's benefit, but you have no use for it in pymembrane
, correct?
Aside:
Now, the DynamicStructure class, it's not clear for me how I am going to change it but currently I am using it as an interface to access some important method like copying file or saving data stuff like this, so you can depend on the generic method.
Even though this is not the place to discuss your repo organization or development roadmap, a simple way to get rid of the above problem is to dissociate IO operations from the class. For example:
# define io functions that you can reuse independently from the class:
# e.g. in pymembrane/io_funcs.py
def do_io1(dir: str):
pass
[...]
# in module where class is defined:
from pymembrane import io_funcs as iof
[...]
class A
def __init__():
[...]
def io1(self, dir:str):
iof.do_io1(dir)
[...]
This way, you could get rid of the class and still have access to the io functions in io_funcs.py
.
I think the best way is to try to depend in the main functions
The lack of module/class/function documentation and the repeated use of (multiple) global imports (from X import *
), along with the import of unused modules, make it very difficult to trace the objects origin or determine what constitutes "a main function".
EDIT:
Oops! That statement above was too hasty: my bad! There are docstrings, but few __init__
methods or functions have them.
I've sorted out which imports we need for microstates_sites_energies
function: we no longer need run_cdc.py
either.
@granepura Would it make sense to return both kinds of microstates matrices?
@granepura Would it make sense to return both kinds of microstates matrices?
Yes, I would return both matrices. They both don’t take very long to make, so might as well make them!
Super!
@granepura Should the gromacs pdbs still contain 'CLA'?
@granepura Should the gromacs pdbs still contain 'CLA'?
Yes it needs to still be called CLA.
Current workflow provided by @granepura: