CatChenal / mcce_dev

0 stars 0 forks source link

MCCE_CDC workflow #10

Open CatChenal opened 10 months ago

CatChenal commented 10 months ago

Current workflow provided by @granepura: mcce_cdc_workflow

CatChenal commented 10 months ago

What happens with the two "parallel" outputs: the .npz file for the Chl sites E shifts and the .csv "matrix" file? About the latter: is having a cvs file a hard requirement? (A python dict can also be saved as a .npy file, then loaded and passed to a pandas.DataFrame directly.)

CatChenal commented 10 months ago

A from @granepura:

Sure, that is totally fine with me if you want the file as a npy file. I just made the vector script into a cvs file because it was easiest thing for me to do with my remedial coding knowledge. The output looked like this, which I found to be straightforward and easy to work with.

gehan_matrix_cvs

Working with pandas.DataFrame in the end would be easier for me to work with regarding the next steps of the project.

CatChenal commented 10 months ago

It's fine to use pandas, but I would like to avoid adding an external package requirement if it's not absolutely necessary. You could still use it in your final analysis. The following code block (delimited by bacticks: ```), shows how you would access the data:

import numpy as np
import pandas as pd

npy_file = "free_residue_matrix.npy"  # stores the pickled dict

# PRE: dict was saved using:
# np.save(npy_file, free_residue_matrix=free_residue_matrix_dict, allow_pickle=True)

matrix_dict = np.load(npy_file, allow_pickle=True)["free_residue_matrix"]
df = pd.DataFrame.from_dict(matrix_dict)
[...]
CatChenal commented 10 months ago

@granepura The modules VectorMS.py and VectorMS_free.py differ by one line.

I propose to use only one and create a proper cli with a flag acting as a switch between the two desired outcomes, e.g.: The command below would output the matrix for free residues as "free_residue_matrix.csv"

VectorMS.py --free_residues_only

While the command without the flag would not perform any filtering and create "residue_matrix.csv":

VectorMS.py
granepura commented 10 months ago

Yes, I believe this is much more appropriate.

CatChenal commented 10 months ago

TODO: Establish package dependencies in pymembrane.

CatChenal commented 10 months ago

Notes on local installation of (private) repo pymembrane:

  1. clone repo
  2. git switch temp_mohamed #switch to most up to date repo
  3. Add init.py to util/CDC/ and util/mcce_util/
  4. pip install -e .
  5. Test: importing an object from the command line python interpreter will likely failed (for yet unknown reasons), but importing in a notebook cell will work.
CatChenal commented 10 months ago

@granepura There are five main steps in the "mcce_cdc_pipeline":

  1. get ms sample
  2. get pdbs from sample
  3. convert pdbs
  4. output ms_matrix
  5. calculate energy shifts

Is it likely that you would want to execute them one by one? (The current option perform 1 through 5 with the arguments provided at the command line.)

CatChenal commented 10 months ago

@granepura When you run vectorMS.py where do you expect the .csv file to be created?

granepura commented 10 months ago

@granepura There are five main steps in the "mcce_cdc_pipeline":

  1. get ms sample
  2. get pdbs from sample
  3. convert pdbs
  4. output ms_matrix
  5. calculate energy shifts

Is it likely that you would want to execute them one by one? (The current option perform 1 through 5 with the arguments provided at the command line.)

Yes, exactly each of those should be separate executions that follow in that order.

granepura commented 10 months ago

@granepura When you run vectorMS.py where do you expect the .csv file to be created?

This looks like the the step for (4). out ms_matrix It should be put in the same folder in which the converted pdbs are located.

CatChenal commented 10 months ago

https://github.com/CatChenal/mcce_dev/issues/10#issuecomment-1832752718 OK, that what I thought!

CatChenal commented 10 months ago

Yes, exactly each of those should be separate executions that follow in that order.

Oops! I have to amend the splitting of the steps: 1 & 2 cannot be uncoupled as they use the conformers and the MSout class from /bin/ms_analysis.py, which is a processing bottleneck, thus we now have:

  1. get ms sample and pdbs from sample -> sub-command
  2. convert pdbs -> sub-command
  3. output ms_matrix and 4. calculate energy shifts -> sub-command (with option to do both or just one of them).

Here is an example of what the cli commands would look like:

For the last group, the args choices could be --do_both (default), --do_matrix or --do_shifts, but only a single one would be accepted.

Again, right now we have mcce_cdc_pipeline.py [necessary args] :: do all steps.

granepura commented 10 months ago

That is what I suspected. At first I assumed step 1 would be sampling you do, so that if you run step2 you use the same sample set. If this is true it doesn't matter if they are not coupled.

However, if you were to run multiple step1s, and you get a different sampling set, then step (1) and step (2) must be uncoupled.

What is this case?

CatChenal commented 10 months ago

Let me clarify: In step1, the steps "get ms sample" and "pdbs from sample" are coupled because the "ms sample" is not saved to file, the pdbs are created straight away, i.e., we have this:

get ms sample -> iter( [get conformers for one ms -> write pdb from conformers]) for each ms in sample.

So, the final product of step1 is the populated ms_pdbs folder.

step1 will produce the same output as long as the same 'deterministic' sampling parameters are used (obviously, it's never the case if the sampling kind is 'random'). Anyway, I have not implementing logging to file, so to repeat an identical step1, you would have to consult you command history, or "the notes you made when you ran it the first time" ;-)

CatChenal commented 10 months ago

@granepura, @melrefaiy2018

I need confirmation/correction on the following in cdc_mcce_microstates_Oct19.py:

  1. dict_cofactors is unused
  2. extended_pdb_name is unused
  3. list_ligands is unused, but passed onto md_object.save_dict_2_txt
  4. the site energy is calculated for only one ligand/cofactor: 'CLA', which is hard-coded

    Also, I would like an example of the filename produced by this:

    # parsed_dir = mcce_dir.joinpath("parsed_pdb_output_mc")
    # md_scratch_dir = parsed_dir.joinpath(f"CDC_{pdb_name}")
    
    md_object.save_dict_2_txt(
        dict_data = occ_dict,
        text_file_name = f"{md_scratch_dir}cdc_results/dict_site_energy_shift_most_occ_"f"{list_ligands=}",
    )

    Thanks.

CatChenal commented 10 months ago

https://github.com/CatChenal/mcce_dev/issues/10#issuecomment-1834587551 Answering my own question:

That's what the file name in text_file_name looks like that:

"/path/to/scratch/cdc_results/dict_site_energy_shift_most_occ_list_cofactors=['CLA', 'CLB', 'BCR', 'SQD', 'HOH', 'MEM']"

Really.

melrefaiy2018 commented 10 months ago

Hey Cat! Hope you're doing well. I'm still waiting for Gehan to pass over the cdc_mcce_microstates_Oct19.py file. Once I get my hands on it, I'll be all set to help you out. Any chance you could shoot it my way? Also, I'm curious about your goals in this project. What are you aiming for? Just want to make sure I can assist you the best way possible. By the way, the script is still a work in progress and not quite ready to be shared, so this is why you are getting lots of errors at the moment—I can whip up a custom script for you to save some time, if I understood what you are looking for exactly.

CatChenal commented 10 months ago

Thanks. I have combined all the code necessary to run the "MCCE to CDC pipeline" into one cli module, which is now under testing. The problem, which I knew would happen when using an alpha-level, unpublished code base as a dependency, is that you have changed the temp_mohamed branch we used to hack-install pymembrane. Specifically, the run_cdc module we were using has been deleted (pymembrane.util.CDC.run_cdc).

I believe both @granepura and myself have a copy of your branch predating the changes, so we can use that, but it's a blow to reproducibility.

melrefaiy2018 commented 10 months ago

I see. I agree with you that this is a problem. I think the best way is to try to depend in the main functions. In other words, most of the function I have build in the util directory are subject to changes, so you can have your own util version as this util directory is just meant to facilitate my work in a way that maybe won't be useful to your work. Second point, I believe the following lines are not going to change in the near future:

# load the extended pdb file into the protein atomic object:
protein_atomic = ProteinAtomic(f'{dir}{filename}', "Isia", center_atomic=False, set_atomic_charges=True)
protein_atomic.prepare_pigments('CLA', ChlorophyllAtomic,
                                          q_0011_charges_dict='CLA_mcce',
                                          qy_atoms=('N1B', 'N1D'),
                                          qx_atoms=('N1A', 'N1C'))
​# Calculate the shift in the site energy
dict_site_energy_shift_mcce, dict_total_contribution_mcce = calculate_cdc_site_shift(
                protein_atomic, dielectric_eff=2.5, protein_contribution=True)
# calculate the total site energy :
dict_site_energy_mcc = calculate_total_site_energy(dict_site_energy_shift_mcce, 14950, 15674)

Now, I think this part will be no longer used since it was meant to use gromacs to generate the partial charges of the protein and now the mcce pdb file should contains the partial charges coming from amber topology:

md_object.prepare_minimization(pdb_path=pdb_path,
            #                                protein_force_field='amber99',
            #                                list_ligands=list_ligands,
            #                                dict_update_residue={'D0003': 'ASP',
            #                                                     'D0179': 'ASP',
            #                                                     'C0179': 'ASP',
            #                                                     'C0003': 'ASP',
            #                                                     'B0179': 'ASP',
            #                                                     'B0004': 'GLU',
            #                                                     'A0179': 'ASP'},
            #                                water_forceField='tip4p',
            #                                ignh=False,
            #                                minimize_ligand=False,
            #                                run_minimization=False
            #                                )
​
            # md_object.build_extended_pdb(dict_cofactors)

Now, the DynamicStructure class, it's not clear for me how I am going to change it but currently I am using it as an interface to access some important method like copying file or saving data stuff like this, so you can depend on the generic method.

Finally, in the latest push on mohamed_temp branch, I tried to organize the files in the util directory in one place, delete unused file and rename files. For example, now we have pigemnts_paramteres, that will have all the data you need related to the pigments. so I believe you should use this version when it's possible.

I hope this answer most of your question, let me know if you still have any other questions.

CatChenal commented 10 months ago

Thanks for the details. I had removed all the commented lines in the original code, so we never had the second code block. To go back to the original question, and please correct me if I'm wrong, what you are saying is that the code in util/CDC/run_cdc.py was put together for @granepura's benefit, but you have no use for it in pymembrane, correct?

Aside:

Now, the DynamicStructure class, it's not clear for me how I am going to change it but currently I am using it as an interface to access some important method like copying file or saving data stuff like this, so you can depend on the generic method.

Even though this is not the place to discuss your repo organization or development roadmap, a simple way to get rid of the above problem is to dissociate IO operations from the class. For example:

# define io functions that you can reuse independently from the class:
# e.g. in pymembrane/io_funcs.py
def do_io1(dir: str):
    pass

[...]

# in module where class is defined:
from pymembrane import io_funcs as iof
[...]
class A
    def __init__():
        [...]

    def io1(self, dir:str):
        iof.do_io1(dir)
[...]

This way, you could get rid of the class and still have access to the io functions in io_funcs.py.

I think the best way is to try to depend in the main functions

The lack of module/class/function documentation and the repeated use of (multiple) global imports (from X import *), along with the import of unused modules, make it very difficult to trace the objects origin or determine what constitutes "a main function". EDIT: Oops! That statement above was too hasty: my bad! There are docstrings, but few __init__ methods or functions have them.

CatChenal commented 10 months ago

I've sorted out which imports we need for microstates_sites_energies function: we no longer need run_cdc.py either.

CatChenal commented 10 months ago

@granepura Would it make sense to return both kinds of microstates matrices?

granepura commented 10 months ago

@granepura Would it make sense to return both kinds of microstates matrices?

Yes, I would return both matrices. They both don’t take very long to make, so might as well make them!

CatChenal commented 10 months ago

Super!

CatChenal commented 10 months ago

@granepura Should the gromacs pdbs still contain 'CLA'?

granepura commented 10 months ago

@granepura Should the gromacs pdbs still contain 'CLA'?

Yes it needs to still be called CLA.