Addition of data ran with other than Gromacs codes

ohsOllila commented 3 years ago

AddData.py already contains the dictionaries also for other than Gromacs simulations. However, the code probably does not fully work with those. This should be fixed at least for OpenMM and Amber data that we already have in Zenodo.

batukav commented 3 years ago

Hello,

I can help with implementing OpenMM and possibly for Amber data.

I think in the long run it would be very helpful for us to create a separate Python library that contains all the functions we are using, both for building and analyzing the data bank, and rewriting the AddData.py to use these functions. For instance, the analysis available in AddData.py around the lines 800 and afterwards won't be possible with OpenMM and Amber files as there is no tpr file from them.

ohsOllila commented 3 years ago

I was looking at the AddData.py code with this issue in mind and below are my thoughts:

Because user currently fills the dictionary directly, the requested files will somewhat depend on the program used for the simulation. The instructions have to be written consistently with the content of amber_dict, namd_dict, charmm_dict, and openmm_dict. This is also related to other issue opened by @batukav https://github.com/NMRLipids/Databank/issues/10.
Calculation of number of lipids begins on line 810. Lines 812-814 are used to make gro file which is given to MDanalysis in line 845. Only lines 812-814 are program dependent here. These should be relatively easy to make for other programs.
Extracting temperature close to line 940 is completely dependent on Gromacs. This must be done separately for different programs depending on their file types, but I think that this should not be difficult.
Extraction of simulation length starting from line 952 is independent on program.
Order parameter calculation after line 1057 is independent in program. However, molecules are made whole before that between lines 1048-1053 in Gromacs dependent manner.

In conclusion, making the AddData.py to work with other programs should not be very complicated.

@batukav do you have OpenMM trajectories in Zenodo which we could start to work with?

batukav commented 3 years ago

@ohsOllila I have one pure POPE trajectory uploaded and ready for further analysis at 10.5281/zenodo.4665773.

2- There's already an implementation in MDAnalysis https://github.com/MDAnalysis/MDAnalysisCookbook/blob/master/examples/membrane-leaflets.py

Maybe this could be an easier for our purposes.

3- I can make the necessary implementation for the OpenMM.

5- To make molecules whole, we can try using MDAnalysis again. Then we can get rid of gmx tools for this point.

ohsOllila commented 3 years ago

2- I do not think that we need modifications in the part which calculates the number of lipids in leaflets. Only thing that needs to be done is to have the structure file in line 845. Where this comes from (creation or input) will be different depending on the program, but this should not be difficult.

5- If you have a good idea how to do it, it would be great.

I think that the first step would be to try to add the data you mentioned.

Only parts which do not seem completely straightforward are to decide which files are compulsory for openMM (i.e. creating the dictionary), and how molecules will be made whole (we can also request people to have molecules in whole in other than Gromacs trajectories if needed).

Do you want to try to modify the code and add the data?

batukav commented 3 years ago

5- I'll work on it.

Yes, in the following days I'll be analyzing the POPE data for the NMRLipids VI project. In the meantime I can try to figure out the compulsory files from the OpenMM data

batukav commented 3 years ago

I created "check_bond.py" that takes the trajectory, checks the bond lengths, and if a bond length is larger than 0.2 nm performs a simple wrapping using the MDAnalysis.

Centering the bilayer without actually visualizing the trajectory is a bit tricky (at least for me). However, for the OP calculation, if I remember correctly, all we need is a wrapped trajectory. We can keep improving the centering options, but as far as I can see it now, the wrapping issue is solved.

Please let me know if there are problems with the script.

ohsOllila commented 2 years ago

Basic analysis codes (order parameters, area per lipid, form factor and thickness) are now implemented for Gromacs and openMM.

For gromacs, the analysis codes tries to use tpr, but if this does not work due to version dependecies, a gro file is created with Gromacs commands and the created gro is used with MDAnalysis.

For openMM, a pdb file is given instead of tpr and pdb is used with MDAnalysis. However, this is not yet updated in the instructions.

For example, see are per lipid calculation in calcALP.py.

I think that similar idea should work also for other programs. However, I am not aware of a code that could be used to check that molecules are whole and that there no inconvenient jumps over pbc in z-direction for other than Gromacs simulations. Therefore, the requirements for other than Gromacs trajectories is more strict currently (this is also not yet in the instructions).

Because this discussion is now overlapping with another issue "Preferred topology field in the input files?" issue, I will close the issue https://github.com/NMRLipids/Databank/issues/10 and we can continue discussion in here.

pbuslaev commented 2 years ago

Have you considered using MDAnalysis transformations with topology provided? At least for me it works fine:

Before:

After:

For unwrap to work the only import thing is that proper topology with all bonds defined is provided

ohsOllila commented 2 years ago

Thanks, this looks useful. However, I am not sure which file we should use as topology with openMM trajectories? Currently we used pdb, but bonds are probably not sufficiently defined there.

pbuslaev commented 2 years ago

From OpenMM documentation it seems that OpenMM should work with one of xml, psf, or tpr formats. All of them contain bond information and are supported by MDAnalysis. This should be tested of course, but now it seems that asking the user to provide topology in one of these formats can be a solution

batukav commented 1 year ago

I remember trying the MDAnalysis transformations for wrapping the trajectories but did not get consistent results. But I don't remember about the details of my experiments, unfortunately. On the other hand, it would be highly useful, at some point, to be able to provide these transformations in the analysis scripts and not rely on "good" user input anymore. Yet, I think this is still a bit low priority to implement.

By simply adding a topology key to the sim dictionary in AddData.py (and other relevant parts) and requiring the user to fill this topology key in the yaml file could help us in the long run. This topology key can be entered to be the same as the already existing tpr and pdb keys, for instance, yet we just use the topology key whenever running any analysis that requires the bond information. If no objections agains it, I would suggest adding the topology key already into the documentation and the manuscript, although we still don't update the "backend" of the databank to use this.

pbuslaev commented 1 year ago

Actually, I now think that we can even overcome providing the topology. If we know what is the lipid name and what the force field is we actually know which atoms are connected (this information can also be stored in the mapping file for instance). Then we can upload use add_TopologyAttr method (check this) to add bonds and then unwrap should work normally for lipids.

batukav commented 1 year ago

Wouldn't this approach have problems if there are custom made/non-standard residues and/or different types of lipids (like glycolipids etc.)? Asking for a topology from the user wouldn't be too much to request, I suppose.

batukav commented 1 year ago

Have you considered using MDAnalysis transformations with topology provided? At least for me it works fine:

Before: After:

For unwrap to work the only import thing is that proper topology with all bonds defined is provided

For a future public reference; what should be the command for centering the bilayere center of mass at the origin and wrapping the remaining atoms?

ohsOllila commented 1 year ago

Actually, I now think that we can even overcome providing the topology. If we know what is the lipid name and what the force field is we actually know which atoms are connected (this information can also be stored in the mapping file for instance). Then we can upload use add_TopologyAttr method (check this) to add bonds and then unwrap should work normally for lipids.

Wouldn't this approach have problems if there are custom made/non-standard residues and/or different types of lipids (like glycolipids etc.)? Asking for a topology from the user wouldn't be too much to request, I suppose.

Yes, I think that we should definitely rather ask the required topology files from the data contributor rather than try to reproduce the information. Question is that which files do we exactly need, for example, for OpenMM? For example, in the recent data uploaded by @batukav (https://doi.org/10.5281/zenodo.7415922, https://doi.org/10.5281/zenodo.7418619), there are some xml files but not psf files. Are these xml files enough? Are both of those needed?

If making molecules whole in a reasonable and robust way in these simulations is not feasible, we can also accept only OpenMM simulations with molecules already whole (which is the case in all current OpenMM simulations in the databank).

For other analyses I have done from the databank (standard + the ones in the manuscript), pdb for OpenMM has been sufficient assuming that the molecules are whole. This would probably apply also to other programs, such as Amber from which there is also data in Zenodo. For Amber we have the same questions: Do we ask some topology file, and if yes, which one? Or do we just ask pdb and require that molecules are whole?

batukav commented 1 year ago

Yes, I think that we should definitely rather ask the required topology files from the data contributor rather than try to reproduce the information. Question is that which files do we exactly need, for example, for OpenMM? For example, in the recent data uploaded by @batukav (https://doi.org/10.5281/zenodo.7415922, https://doi.org/10.5281/zenodo.7418619), there are some xml files but not psf files. Are these xml files enough? Are both of those needed?

With OpenMM one can generate the topology information within the Python scripts that are used to run the simulation. Therefore sometimes one does not need to create a separate topology file and read it into the simulation. This was the approach taken by the AMOEBA script I have used. The xml files are the force field files. It should be indeed possible for us to parse the force field file to get the bond information and use this to check from the pdb file + trajectory if the molecules are whole. I don't know how fast we can come up with an implementation. Also, this is a rather edge case that should not be a high priority, in my opinion. That being said, I will check if I can somehow save the topology information from OpenMM so that I can update the Zenodo files.

Current order parameter script calculates the bond lengths and raises a warning if this is beyond a threshold. We can urge the users to upload a topology file and this should not be a problem in a large majority of the cases. For the uploads without a topology information, we can raise a warning in the scripts saying that we cannot make the molecules whole therefore this and that functionality of the databank won't work.

For other analyses I have done from the databank (standard + the ones in the manuscript), pdb for OpenMM has been sufficient assuming that the molecules are whole. This would probably apply also to other programs, such as Amber from which there is also data in Zenodo. For Amber we have the same questions: Do we ask some topology file, and if yes, which one? Or do we just ask pdb and require that molecules are whole?

Simulations from the AMBER MD engine requires prmtop files to run. These files can be read into MDAnalysis and subsequently be used. So, we should ask for the prmtop files, otherwise raise a warning.

batukav commented 1 year ago

If making molecules whole in a reasonable and robust way in these simulations is not feasible, we can also accept only

OpenMM simulations with molecules already whole (which is the case in all current OpenMM simulations in the databank). Except the POPC with 1000 mM CaCl2 (10.5281/zenodo.5507316), all OpenMM trajectories from the Drude simulations have the molecules whole. I'll fix the issues with this trajectory and re-upload it.

ohsOllila commented 1 year ago

I have now enabled addition of simulations ran with NAMD program in exactly the same way as openMM simulations. This seems to work. This might be the way to go also for data with other programs. We can try this when the first contributions with other programs will come.

ohsOllila commented 6 months ago

I will close this. If there will be data from new programs, we can open new issues specifically for each program.

NMRLipids / Databank

Addition of data ran with other than Gromacs codes #5