ohsOllila commented 3 years ago

Methods to check how equilibrated are trajectories in the databank were discussed in the NMRlipids online meeting on 26th of February 2021. Following the variation of area per molecule between different trajectory segments and usage of PCA analysis were proposed. However, practical details were not discussed. We can continue the discussion here.

ivan-gushchin-mipt commented 3 years ago

a) Regarding the total area or area per lipid, I think the following: area per lipid is more informative, and easy to calculate for small single component membranes, but more difficult to calculate for larger undulating systems or when many different lipid types are present - it will require Voronoi tesselation or something like this. APL distribution can also help in spotting the phase separation. APLs are also almost always tested and discussed in the publications with simulations of lipid membranes, so it would be nice to have them pre-calculated in the Databank.

On the other hand, a plot showing dependence of total system area on time could also help for qualitative assessment (by eye) of how large and slow or fast are the fluctuations. To get a quantitative estimate, the easiest is to require that the observed total area in the first half of the trajectory is distributed similarly to the total area in the second half. Perhaps someone who knows statistics better than me (@pbuslaev?) can propose a better quantity? Such as "ratio of decorrelation time to trajectory length" (the smaller the better).

b) Regarding the usage of PCA analysis, it can provide a much more quantitative description of dynamics and highlight the slowest processes (motions). The analysis is not fast, but if we only analyze the autocorrelation as a function of lag time, this can be done in under 1 hour per usual trajectory on a single work station - or perhaps even faster after optimization. Python scripts for automated lipid PCA are already available in some form, @pbuslaev can comment on this.

batukav commented 3 years ago

Hello,

b) Regarding the usage of PCA analysis, it can provide a much more quantitative description of dynamics and highlight the slowest processes (motions). The analysis is not fast, but if we only analyze the autocorrelation as a function of lag time, this can be done in under 1 hour per usual trajectory on a single work station - or perhaps even faster after optimization. Python scripts for automated lipid PCA are already available in some form, @pbuslaev can comment on this.

@ivan-gushchin-mipt could you please tell which files are necessary for the PCA analysis? If we decide to carry out the PCA analysis, I'll need to make sure that it also works for OpenMM and Amber files.

ivan-gushchin-mipt commented 3 years ago

Hi @batukav, I'm sorry for the long wait, generally we need only the trajectory files for PCA (.xtc and .gro in case of GROMACS). @pbuslaev can say whether the current script can accept OpenMM and Amber files directly, or whether it can be modified to accept such files - likely yes. It is based on MDTraj, which seems to work with both OpenMM and Amber https://mdtraj.org/1.9.4/load_functions.html.

pbuslaev commented 3 years ago

Hi @batukav, @ivan-gushchin-mipt.

I have tested PCAlipids code with OpenMM and namd simulations and it worked fine. So I guess we can use it to compute equilibration times.

I need to make an enveloping script that will allow to use PCAlipids directly from databank, but that should not be difficult. I will work on that.

ohsOllila commented 3 years ago

I added now codes that loop over all simulations in here: https://github.com/NMRLipids/Databank/tree/main/Scripts/AnalyzeDatabank

The dihedral code is slightly updated version of the code Pavel contributed in here: https://github.com/NMRLipids/NMRlipidsIVPEandPG/issues/9 I think that it should work for all programs. I did not commit the dihedral results yet, because it is still running on my computer.

Density calculation code works only for Gromacs now, but should be generalized (we need density profiles to look water permeation through membranes which is one of the example applications).

Anyway, making PCA code based on similar idea as in these should be indeed not difficult.

ohsOllila commented 1 year ago

Thanks to the work by @pbuslaev, @BananaOverLord and others, we have now a code that estimates how equilibrated are lipid conformational ensembles using the PCA analysis: https://github.com/NMRLipids/Databank/blob/main/Scripts/BuildDatabank/NMRPCA_timerelax.py. I think that we should include this analysis in the first publication about the databank.

The analysis is not yet ran over all the data, but this should be relatively straightforward now. However, there are still some issues raised in the recent pull request https://github.com/NMRLipids/Databank/pull/123. I have moved the discussion into this issue.

Open questions: Where do we store the relaxation times?

We can put the values in json files in the folders for each simulation, similarly as thickness values are stored in thickness.json files in folders in here: https://github.com/NMRLipids/Databank/tree/main/Data/Simulations

What do we want to do with non-GROMACS trajectories. In the main branch they were skipped, I am doing the same now, but there is no reason for that.

The analysis should work currently at least for openMM with using PDB files. However, there is currently no way to check whether molecules are whole in these trajectories. Anyway, we should run this also for at least openMM trajectories unless there are further complications. For more discussion about other than Gromacs trajectories is in issue https://github.com/NMRLipids/Databank/issues/5.

TODO: Amber treatment with 3 residues per lipid still has to be added.

This would be good. Is it lot of work?

ohsOllila commented 1 year ago

One additional remark about the PCA code: The list of lipids is defined in line https://github.com/NMRLipids/Databank/blob/56cbcab50b1f73bf8cdf6558643e0a4f3926ffc9/Scripts/BuildDatabank/NMRPCA_timerelax.py#L105. However, the list of lipids is defined in lipids_dict in https://github.com/NMRLipids/Databank/blob/main/Scripts/BuildDatabank/databankLibrary.py. It would be better just to import this because then the new lipids have to be added only in the databankLibrary.py, not separately to the PCA code.

pbuslaev commented 1 year ago

We can put the values in json files in the folders for each simulation, similarly as thickness values are stored in thickness.json files in folders in here: https://github.com/NMRLipids/Databank/tree/main/Data/Simulations

This I can easily do

TODO: Amber treatment with 3 residues per lipid still has to be added.

This would be good. Is it lot of work?

As I currently see it - no. About 50 lines. Could you provide me a path to the trajectory, I can use as a test?

However, the list of lipids is defined in lipids_dict in https://github.com/NMRLipids/Databank/blob/main/Scripts/BuildDatabank/databankLibrary.py. It would be better just to import this because then the new lipids have to be added only in the databankLibrary.py, not separately to the PCA code.

I will check it. There were some problems with naming before. Most probably solved now

ohsOllila commented 1 year ago

As I currently see it - no. About 50 lines. Could you provide me a path to the trajectory, I can use as a test?

This could be one example: https://github.com/NMRLipids/Databank/tree/main/Data/Simulations/1c4/77d/1c477da411113327d1c0e43eea10b0f096031d45/35a315cecf0156381237417893bf6755b08ab3e8 Note that the residues are defined in the mapping files if this helps, for example, for POPC: https://github.com/NMRLipids/Databank/blob/main/Scripts/BuildDatabank/mapping_files/mappingPOPClipid17.yaml

I will check it. There were some problems with naming before. Most probably solved now

Let me know if there are problems, it should work.

pbuslaev commented 1 year ago

However, the list of lipids is defined in lipids_dict in https://github.com/NMRLipids/Databank/blob/main/Scripts/BuildDatabank/databankLibrary.py

Do these names correspond to lipid names in the topology/structure? Since we need to select trajectories of the particular lipids from the whole trajectory, we need to be sure that the names in this dictionary actually do correspond to the names of the trajectory. I am not 100% sure that it is the case, especially for those lipids: 'PAzePCprot' : {"REQUIRED": False, "TYPE" : "string", }, 'PAzePCdeprot' : {"REQUIRED": False, "TYPE" : "string", }

ohsOllila commented 1 year ago

Do these names correspond to lipid names in the topology/structure?

No, these names are universal molecule names used in the NMRlipids databank. These are listed together with the molecule names in second table in here: https://github.com/NMRLipids/Databank/blob/main/Scripts/BuildDatabank/info_files/README.md

The lipid names in topology/structure are accessible through README.yaml files like this: readme["COMPOSITION"][lipid]["NAME"] where lipid is the universal name listed in lipids_dict. This should then give the topology/structure specific name.

However, in the Amber style of cases where different parts of lipids have different residue names, only one of them is given by this command. However, in those cases the residue names for each atom are stored in the mapping, see for example: https://github.com/NMRLipids/Databank/blob/main/Scripts/BuildDatabank/mapping_files/mappingPOPClipid17.yaml In codes I have done it usually such that the code is trying to read the residue name from the mapping file and if this fails, then the exception reads it from readme["COMPOSITION"][lipid]["NAME"]. This should work because if the residue name cannot be found from the mapping file, then the one from latter option should work.

pbuslaev commented 1 year ago

Ah, ok. Now I see. Then I should do some small modifications to use standard names

ohsOllila commented 1 year ago

I modified the loop such that it goes over all simulations if the result is not yet found and simulation is not ran with openMM (I pushed this version to GitHub already). It successfully analyzed several simulations (which probably have the same lipid names as in the current list), but then failed to this one: https://github.com/NMRLipids/Databank/tree/main/Data/Simulations/07d/cf3/07dcf38cf72985a9535571aaa39d7ae76921838b/9633603ac288bf90cb41aeac480bd1b222658439

I am not sure whether it is because of the lipid naming issue, because it is united atom simulation, because this may in hexagonal box, or several of these reasons.

If this or some other type of system is difficult to implement for this analysis, we can add a WARNING key in the README.yaml file and skip these simulation in this analysis. We have done this for some systems, for example when membrane normal is not in z-direction.

pbuslaev commented 1 year ago

I will solve the naming issue now and let's then try again to run the analysis.

pbuslaev commented 1 year ago

I modified the checks a bit, so they are not in the main loop now, but in the Parser class. Now non-GROMACS trajectories are skipped and if eq_time.json is already in the folder then the calculation is not initiated. Also, I modified the lipid list, so now lipid names come from databankLibrary:lipids_dict and actual residue name in the structure comes from readme["COMPOSITION"][lipid]["NAME"]

pbuslaev commented 1 year ago

I modified the loop such that it goes over all simulations if the result is not yet found and simulation is not ran with openMM (I pushed this version to GitHub already). It successfully analyzed several simulations (which probably have the same lipid names as in the current list), but then failed to this one: https://github.com/NMRLipids/Databank/tree/main/Data/Simulations/07d/cf3/07dcf38cf72985a9535571aaa39d7ae76921838b/9633603ac288bf90cb41aeac480bd1b222658439

The current code successfully worked for this trajectory

ohsOllila commented 1 year ago

I modified the checks a bit, so they are not in the main loop now, but in the Parser class.

Somehow these did not work for me, but I added the same checks in the main loop and got forward.

Next issue that I ran into was that there are tpr files with Gromacs versions that are not supported by MDanalysis. Nevertheless this can circumvented by using gro files as we have done in other codes. I added this in the Parser class and it worked for now.

Then I got this error:

Parser: Processing trajectory 0a3/3e7/0a33e7ca5e8c7a399e40936777ffa7b236154ed8/8e731b9c41f4092328aa92f73fd89f57887c3ebd Parser: Iterating over all trajectories. Current trajectory is 0a3/3e7/0a33e7ca5e8c7a399e40936777ffa7b236154ed8/8e731b9c41f4092328aa92f73fd89f57887c3ebd ../../Data/Simulations/../../Data/Simulations/0a3/3e7/0a33e7ca5e8c7a399e40936777ffa7b236154ed8/8e731b9c41f4092328aa92f73fd89f57887c3ebd//eq_times.json Parser: Founder centered trajectory: centered.xtc Traceback (most recent call last): File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 845, in parser.concatenateTraj() File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 236, in concatenateTraj concatenator = Concatenator(topology, File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 442, in init self.topology.runMerger() File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 398, in runMerger f"not name H* and resname {resnameDict[TAILSN1]} and " KeyError: 'sn-1'

This is probably because there are ambiguous atom names in this simulation. This is denoted in the 'WARNIGNS' key in here: https://github.com/NMRLipids/Databank/tree/main/Data/Simulations/0a3/3e7/0a33e7ca5e8c7a399e40936777ffa7b236154ed8/8e731b9c41f4092328aa92f73fd89f57887c3ebd These simulations are skipped for example in order parameter analyses, and we probably have to skip these also in here? There is only few such simulations so this is not a big issue. I added this also in the main loop.

Now the code is running again, I will keep you updated how it proceeds. The new versio is pushed also here.

ohsOllila commented 1 year ago

Now I got this error which I do not understand:

Parser: Processing trajectory 0a4/101/0a41017641414540973e921ed22528d1f3dc414b/3c0936e61fa40cce74fd1828a2697742709e91fb Parser: Iterating over all trajectories. Current trajectory is 0a4/101/0a41017641414540973e921ed22528d1f3dc414b/3c0936e61fa40cce74fd1828a2697742709e91fb ../../Data/Simulations/../../Data/Simulations/0a4/101/0a41017641414540973e921ed22528d1f3dc414b/3c0936e61fa40cce74fd1828a2697742709e91fb//eq_times.json Parser: Founder centered trajectory: centered.xtc Concatenator: Concatenating lipid DPPC /home/osollila/.conda/envs/py39/lib/python3.9/site-packages/MDAnalysis/analysis/align.py:1417: SelectionWarning: Atoms could not be matched since they don't contain masses. warnings.warn(msg, category=SelectionWarning) Main: parsing lipid DPPC Main: PCA: done Main: Projections: done Main: Autocorrelations: done Traceback (most recent call last): File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 862, in te2 = TimeEstimator(pca_runner.autocorrelation).calculate_time() File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 782, in calculate_time te1 = self.timerelax() File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 765, in timerelax A, B = self.get_nearest_value(R_log, power) File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 730, in get_nearest_value A = np.where(iterable < value)[0][0] IndexError: index 0 is out of bounds for axis 0 with size 0

The system is this: https://github.com/NMRLipids/Databank/tree/main/Data/Simulations/0a4/101/0a41017641414540973e921ed22528d1f3dc414b/3c0936e61fa40cce74fd1828a2697742709e91fb

pbuslaev commented 1 year ago

I will need to run some tests. I will report on the progress soon

pbuslaev commented 1 year ago

126 Should solve the error, but it actually indicates that the system is not equilibrated

ohsOllila commented 1 year ago

https://github.com/NMRLipids/Databank/pull/126 Should solve the error, but it actually indicates that the system is not equilibrated

Yes, this system is probably in gel phase so this is expected.

Now I am experiencing other error for some systems. This is one example: https://github.com/NMRLipids/Databank/tree/main/Data/Simulations/813/2bc/8132bc41c60e0156caeb7a99f11efb3206fa1619/69de81ab0e12664c93685826a4b1734bcde763fe

For this, I get this error: ../../Data/Simulations/../../Data/Simulations/813/2bc/8132bc41c60e0156caeb7a99f11efb3206fa1619/69de81ab0e12664c93685826a4b1734bcde763fe//eq_times.json Parser: Founder centered trajectory: centered.xtc Concatenator: Concatenating lipid PYPC Main: parsing lipid PYPC Main: PCA: done Main: Projections: done Main: Autocorrelations: done Main: EQ time: done [0.04453294 0.04571682 0.04664062 0.04718968 0.04585822 0.04594673 0.04587408 0.04519818 0.04353252] Traceback (most recent call last): File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 877, in parser.dumpData(equilibration_times) File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 246, in dumpData json.dump(data, f) File "/home/osollila/.conda/envs/py39/lib/python3.9/json/init.py", line 179, in dump for chunk in iterable: File "/home/osollila/.conda/envs/py39/lib/python3.9/json/encoder.py", line 431, in _iterencode yield from _iterencode_dict(o, _current_indent_level) File "/home/osollila/.conda/envs/py39/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict yield from chunks File "/home/osollila/.conda/envs/py39/lib/python3.9/json/encoder.py", line 438, in _iterencode o = _default(o) File "/home/osollila/.conda/envs/py39/lib/python3.9/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.class.name} ' TypeError: Object of type ndarray is not JSON serializable

It somehow looks that it gets multiple results for this. The same happens for some other systems as well, but for most systems the code seems to work fine.

ohsOllila commented 1 year ago

Another error that occurs for some systems (only a few) is this:

../../Data/Simulations/../../Data/Simulations/b86/e1d/b86e1d05121438d07f56d447bf4a0ec1add4af0c/b7cad43d5423af6dee7da746eabe48b79e79c271//eq_times.json Parser: Founder centered trajectory: centered.xtc Concatenator: Concatenating lipid POPC /home/osollila/.conda/envs/py39/lib/python3.9/site-packages/MDAnalysis/analysis/align.py:1417: SelectionWarning: Atoms could not be matched since they don't contain masses. warnings.warn(msg, category=SelectionWarning) Concatenator: Concatenating lipid POPS Traceback (most recent call last): File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 854, in parser.concatenateTraj() File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 238, in concatenateTraj self.concatenated_trajs.append(concatenator.concatenate()) File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 579, in concatenate concatenated_traj, n_lipid, n_frames = self.concatenateTraj() File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 477, in concatenateTraj coords = coords.reshape((n_frames, n_lipid, n_atoms_lipid, 3)) ValueError: cannot reshape array of size 144222624 into shape (77042,72,52,3)

ohsOllila commented 1 year ago

I observed also this error:

Parser: Processing trajectory 387/6d8/3876d878edf3efd6e4b40a7e78661581b1dfd387/b048d322ae8d7ee95e888b26f87dda3ec6fd9350 Parser: Iterating over all trajectories. Current trajectory is 387/6d8/3876d878edf3efd6e4b40a7e78661581b1dfd387/b048d322ae8d7ee95e888b26f87dda3ec6fd9350 ../../Data/Simulations/../../Data/Simulations/387/6d8/3876d878edf3efd6e4b40a7e78661581b1dfd387/b048d322ae8d7ee95e888b26f87dda3ec6fd9350//eq_times.json Parser: Founder centered trajectory: centered.xtc Concatenator: Concatenating lipid POPE Traceback (most recent call last): File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 854, in parser.concatenateTraj() File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 234, in concatenateTraj concatenator = Concatenator(topology, File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 440, in init self.topology.runMerger() File "/media/osollila/Data/NMRlipids/Databank/Scripts/BuildDatabank/NMRPCA_timerelax.py", line 396, in runMerger f"not name H* and resname {resnameDict[TAILSN1]} and " KeyError: 'sn-1'

pbuslaev commented 1 year ago

127 Should solve all these bugs

pbuslaev commented 1 year ago

@ohsOllila I have a couple of general comments and questions actually:

While working on the code, I noted some inconsistencies in data organization. One example is different entries in the mapping files for various force fields (Lipid17 FF has a field RESIDUE, while other force fields do not). I understand the reason for that, but it doesn't seem general and such decisions might lead to limitations in the future. Therefore, a question is what is the long-term plan for the DataBank? How do you expect others to contribute to the code base in the future? Will things largely depend on you as it is now (at least, as far as I understand the current mechanism, you are the key person to contact for potential contributors)? Probably making some clear instruction on how one can contribute to the code base is worth?
The second question that I have is relating to the test coverage of the code used for analysis. Have you thought about this? In my opinion, this might be quite important in the future, especially if at some point you decide to make some changes in the data structure, it can be quite easy to forget to adjust all the analysis scripts accordingly and unit tests can help to track potential issues.

I guess, it is worth opening new independent issues to discuss those questions, so if you think those discussions are important, let me know and I will open separate branches then.

batukav commented 1 year ago

@pbuslaev in case we get issues with running the analysis over openmm trajectories, how can I use your script in a stand-alone way?

pbuslaev commented 1 year ago

@batukav Currently there is a check in Parser class that in principle should prevent it. If you want, you can modify it. Also, the script is now designed to work with NMR Databank data structure and uses data from README and mapping files. So if you have a standalone trajectory, then the script probably won't work. However, in that case, you should be able to use PCAlipids code, which I can help you with.

batukav commented 1 year ago

@pbuslaev I'd like to run the PCAlipids code for the trajectories at https://zenodo.org/record/7415922 and https://zenodo.org/record/7418619 . The se are dcd trajectories with only pdb available as the topology. Would these files be enough to run the analysis?

ohsOllila commented 1 year ago

I have now ran the code through the simulations in databank with the exception of simulations with OpenMM and with trajectories larger than 5Gb in size. I have pushed the results in files (eq_times.json) in databank folders in here: https://github.com/NMRLipids/Databank/tree/main/Data/Simulations

We should now decide how we will present these in the first databank manuscript. In minimum, we should mention that such values are available for each simulation and they can be used to estimate how equilibrated molecules are from the point of view of principal components. In addition, it might be nice to show some more analyzes in the SI. Any ideas on this? We also need a description of this analysis in the SI.

Regarding the OpenMM simulations, after a quick look at the code it should not be very difficult to update the parser part such that it would work for OpenMM trajectories as well. I think that molecules are whole in all OpenMM trajectories that are currently in the databank so we do not have to worry on that right now. Would someone have time to take a look at this? For example, here is one OpenMM dataset: e69/c46/e69c46a7f69132e734a5dc908ef0bbfa03a91abd/e69c46a7f69132e734a5dc908ef0bbfa03a91abd/

I had to discard large simulation trajectories due to memory consumption. Would it be possible to do something on that so that we could also analyze large simulations?

pbuslaev commented 1 year ago

@pbuslaev I'd like to run the PCAlipids code for the trajectories at https://zenodo.org/record/7415922 and https://zenodo.org/record/7418619 . The se are dcd trajectories with only pdb available as the topology. Would these files be enough to run the analysis?

@batukav If lipids are whole, then it should be enough. I actually successfully used it for other AMOEBA simulations you shared with me. If you have any issues with using it, let me know. I can then either run the analysis myself, or help with the issues.

pbuslaev commented 1 year ago

We should now decide how we will present these in the first databank manuscript. In minimum, we should mention that such values are available for each simulation and they can be used to estimate how equilibrated molecules are from the point of view of principal components. In addition, it might be nice to show some more analyzes in the SI. Any ideas on this? We also need a description of this analysis in the SI.

I can add a methodological part to the manuscript this week for sure. Also, I think adding a panel (histogram similar to panel b) to figure 1 with the distributions of equilibration metric can be considered. In the SI a comparison of liquid and gel phases can be given for instance. But probably @ivan-gushchin-mipt has better ideas.

Regarding the OpenMM simulations, after a quick look at the code it should not be very difficult to update the parser part such that it would work for OpenMM trajectories as well.

I can do this. Could you please share a trajectory code to test on?

I had to discard large simulation trajectories due to memory consumption. Would it be possible to do something on that so that we could also analyze large simulations?

One of the solutions can be that we take only every 5th or 10th frame for trajectories larger than some value. I can also add this to the code

ohsOllila commented 1 year ago

Also, I think adding a panel (histogram similar to panel b) to figure 1 with the distributions of equilibration metric can be considered. In the SI a comparison of liquid and gel phases can be given for instance. But probably @ivan-gushchin-mipt has better ideas.

I believe that people would be interested, for example, to have an idea how long trajectories are required on average to equilibrate this property. I know that there are interesting dependencies on the system, force field and temperature in here, but it might be difficult to present many of these concisely. Some examples might be nice thought. One idea might be to show a scatter plot with simulation length on one axis and relative equilibration time on other, including distributions in both axes. Something like this: https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_hist.html. This should be straightforward to create from the databank with similar codes that have been used for other figures. Also equilibration time vs. temperature might be interesting.

I can do this. Could you please share a trajectory code to test on?

For example, this one mentioned also above could be used: e69/c46/e69c46a7f69132e734a5dc908ef0bbfa03a91abd/e69c46a7f69132e734a5dc908ef0bbfa03a91abd/

One of the solutions can be that we take only every 5th or 10th frame for trajectories larger than some value. I can also add this to the code

This sounds good if this is ok for the PCA analysis. I had problems with trajectories larger than approximately 5 Gb. Therefore, I added this if readme['TRAJECTORY_SIZE'] > 5000000000: continue in line https://github.com/NMRLipids/Databank/blob/d3b9ad71e2612c4cd6db8c273b4c9fd2aefd2ad2/Scripts/BuildDatabank/NMRPCA_timerelax.py#L859

pbuslaev commented 1 year ago

This sounds good if this is ok for the PCA analysis.

Yes, PCA is quite stable to different time steps between frames. I will pull an update this week.

batukav commented 1 year ago

@pbuslaev I'd like to run the PCAlipids code for the trajectories at https://zenodo.org/record/7415922 and https://zenodo.org/record/7418619 . The se are dcd trajectories with only pdb available as the topology. Would these files be enough to run the analysis?

@batukav If lipids are whole, then it should be enough. I actually successfully used it for other AMOEBA simulations you shared with me. If you have any issues with using it, let me know. I can then either run the analysis myself, or help with the issues.

The lipids in those trajectories should be whole. It could be more error free if you could run the code on them, if you have time.

ohsOllila commented 1 year ago

@pbuslaev I'd like to run the PCAlipids code for the trajectories at https://zenodo.org/record/7415922 and https://zenodo.org/record/7418619 . The se are dcd trajectories with only pdb available as the topology. Would these files be enough to run the analysis?

@batukav If lipids are whole, then it should be enough. I actually successfully used it for other AMOEBA simulations you shared with me. If you have any issues with using it, let me know. I can then either run the analysis myself, or help with the issues.

The lipids in those trajectories should be whole. It could be more error free if you could run the code on them, if you have time.

After these simulations are in the databank and the code works also for OpenMM simulations, the relative PCA equilibration time will be automatically analyzed from these. So if you want to see only this, we should get it automatically soon.

pbuslaev commented 1 year ago

@ohsOllila @batukav one strange thing about the trajectory e69/c46/e69c46a7f69132e734a5dc908ef0bbfa03a91abd/e69c46a7f69132e734a5dc908ef0bbfa03a91abd/:

According to Zenodo description:

500 ns MD simulation of pure POPC membrane using Charmm-Drude polarizable force field. The system contains 128 POPC lipids, 41 CaCl2, and 6400 SWM4 water molecules. wrapped.dcd has a frame saving frequency of 100 ps. Before running the Drude simulation, the system has been equilibriated using Charmm36 force field for 200 ns. The last frame of that simulation was used to generate Drude polarizable model. The first 100 ns of the Drude simulation has been discarded from this dataset.

The trajectory should have either 5000 or 4000 frames. However, after downloading it and manually checking I got only 3409 frames. Also the reported trajectory length in the README file is somewhat strange TRJLENGTH: 5136.000168715241 Should not it be in ps for any force field?

ohsOllila commented 1 year ago

Thanks for pointing this out. It seems that most of the OpenMM data in the databank have something wrong with the time information. By looking at one of my own datasets (35d/28e/35d28ea69cf190c65e222763733daf01fe2fd11e/35d28ea69cf190c65e222763733daf01fe2fd11e/README.yaml), I was able to track this down to combining two pieces of trajectories. I do not remember how I did it exactly in here, but it has not worked properly. Do you know what would be the best and correct way to combine two dcd trajetories simulated with OpenMM?

Anyway, there are some simulations from OpenMM in the databank with the correct time information. You can use, for example, this data 3a2/6a9/3a26a9b5cce79fce362a17e30b4a74e2b2ee8d71/3a26a9b5cce79fce362a17e30b4a74e2b2ee8d71/README.yaml to test the PCA code while we will figure out how to properly combine the trajectories.

pbuslaev commented 1 year ago

Do you know what would be the best and correct way to combine two dcd trajetories simulated with OpenMM?

One one to do that of course is through MDAnalysis, but you will have to manually define the correct time for each frame in the merged trajectory.

You can use, for example, this data

To check that PCA works it probably does not matter which data I use, but of course if I use the wrong trajectory lengths, the equilibration times are also wrong.

ohsOllila commented 1 year ago

One one to do that of course is through MDAnalysis, but you will have to manually define the correct time for each frame in the merged trajectory.

Would happen know if I could find an example of this from somewhere? I am not very convenient with MDAnalysis and could not find this anywhere.

pbuslaev commented 1 year ago

Some dirty solution can be like that:

import MDAnalysis as mda

top = "input.pdb"
traj1 = "traj1.dcd"
traj2 = "traj2.dcd"
timestep = 10 #ps

u = mda.Universe(top, traj1, traj2)
for ts in u.trajectory:
    ts.dt = timestep

with mda.Writer("traj_out.dcd",  u.select_atoms("all").n_atoms) as W:
    for ts in u.trajectory:
        W.write(u.select_atoms("all"))

ohsOllila commented 1 year ago

Some dirty solution can be like that:

Thanks. However, I cannot still get the timestep correct. It is 1ps in the resulting trajectory even thought it was 10 in the original.

pbuslaev commented 1 year ago

Ok, I now think that the issue might be with some strange format of dcd, e.g. here they mention that times are saved in some strange units. But actually for me saving as xtc worked pretty fine. Something along those lines:

with mda.Writer("traj_out.xtc",  u.select_atoms("all").n_atoms) as W:
    for ts in u.trajectory:
        W.write(u.select_atoms("all"))

batukav commented 1 year ago

@pbuslaev

@ohsOllila @batukav one strange thing about the trajectory e69/c46/e69c46a7f69132e734a5dc908ef0bbfa03a91abd/e69c46a7f69132e734a5dc908ef0bbfa03a91abd/:

According to Zenodo description:

500 ns MD simulation of pure POPC membrane using Charmm-Drude polarizable force field. The system contains 128 POPC lipids, 41 CaCl2, and 6400 SWM4 water molecules. wrapped.dcd has a frame saving frequency of 100 ps. Before running the Drude simulation, the system has been equilibriated using Charmm36 force field for 200 ns. The last frame of that simulation was used to generate Drude polarizable model. The first 100 ns of the Drude simulation has been discarded from this dataset.

The trajectory should have either 5000 or 4000 frames. However, after downloading it and manually checking I got only 3409 frames. Also the reported trajectory length in the README file is somewhat strange TRJLENGTH: 5136.000168715241 Should not it be in ps for any force field?

When I downloaded the trajectory, I have 5136 frames. Maybe something went wrong with your download?

One reason for the weird timestamp is that I have used generally VMD for merging the trajectories. Then, some of them had to be loaded to MDAnalysis for wrapping. I'm not sure if this difference in the timestamp will affect any of the workflow we have.

pbuslaev commented 1 year ago

One reason for the weird timestamp is that I have used generally VMD for merging the trajectories. Then, some of them had to be loaded to MDAnalysis for wrapping. I'm not sure if this difference in the timestamp will affect any of the workflow we have.

I would say that at least figure 1b is biased by the problem with timesteps. The trajectory length in the README is clearly wrong. This might also lead to confusion for the users.
The other problem, at least as I see it, that these kind of manipulations (merging with VMD) are not reported. So in my opinion there should be either recommended procedure for such manipulations, or it should be somehow reported.

batukav commented 1 year ago

The other problem, at least as I see it, that these kind of manipulations (merging with VMD) are not reported. So in my opinion there should be either recommended procedure for such manipulations, or it should be somehow reported.

I think this is a good catch. I wasn't aware of the timestamp will not be correctly handled with VMD and would cause problems with MDAnalysis. I can update the metadata accordingly to include the information about the concatenation of the trajectories (as I really didn't think it would be worth mentioning) as a remedy. But, we could consider raising a warning in the trajectory analysis if the timestamp is not continuous or simply weird. Would this be useful?

And yes, I agree that we should create recommendations for such procedures.

I would say that at least figure 1b is biased by the problem with timesteps. The trajectory length in the README is clearly wrong. This might also lead to confusion for the users.

Figure 1b of what? I will take a closer look into the problems with this trajectory and the readme file in 1-2 days and will update here.

pbuslaev commented 1 year ago

Figure 1b of what? I will take a closer look into the problems with this trajectory and the readme file in 1-2 days and will update here.

Of the current Databank manuscript.

ohsOllila commented 1 year ago

We should definitely have the correct timestamp in all trajectories that are in the databank. This is not only due to reasons mentioned above, but also to enable analyses related to dynamics in the future.

I tried many different options to combine two dcd trajectories such that the timestamps would be correct, but I did not manage to do that. Nevertheless, I managed to create a xtc file with correct timestamps from two dcd trajectories. First I transformed both dcd trajectories to xtc like this:

from MDAnalysis import *
u = Universe('start.pdb', "trajCORRECT2.dcd")
with Writer("traj2.xtc",  u.select_atoms("all").n_atoms) as W:
    for ts in u.trajectory:
        W.write(u.select_atoms("all"))

Then I combined the two resulting xtc files with the normal Gromacs procedure using gmx trjcat -settime. All the analysis codes worked also with this new xtc file.

In addition to having correct timestamps, the xtc file has substantially smaller size (7.1GB) than the dcd file (23.1GB). My conclusion from this exercise is that the xtc file format is much better for our purposes and we should use it whenever possible. Because OpenMM supports both formats, it should possible to directly save simulations in xtc format in the future, right?

I have now update the data from https://doi.org/10.5281/zenodo.5946835 with correct timestamps in the databank, see the commit above. We should do the same also for all the other OpenMM simulations with incorrect timestamps. Note that the databank entry code will change because the trajectory file will be different. Therefore, the entries with the trajectory having incorrect timestamps have to be also removed.

batukav commented 1 year ago

@pbuslaev @ohsOllila

Sorry for the late reply.

The timestamp of each trajectory can be passed to the MDAnalysis universe using the dt option.

u = mda.Universe(topology, trajectory, dt = 10)

will load the trajectory and set the timestamp to 10.

Just to avoid future confusion: timestep is the time difference between two consecutive frames and timestamp is the time corresponding to a certain frame obtained via frame_number * timestep .

With this, I think, all I need to do is to fix the info files for the Drude simulations in the databank. Then, we can always make the MDAnalysis to use the timestep from the info file.

Does this sound reasonable? If so, I'll go ahead and just correct the info files without editing the trajectories.

ohsOllila commented 1 year ago

If I understand correctly, you suggest to store timestamp information to README.yaml files and explicitly read this information in the analysis codes. I think that we should not do this because then this should be taken into account also in all future codes analysing the NMRlipids databank. I think that it is better to require that the trajectories have the correct timestamp information so this does not have to be taken into account by the analysis codes. I still think that the best option is to transform trajectories to xtc format as described above.

pbuslaev commented 1 year ago

@ohsOllila If I understand @batukav correctly, the only thing that might be needed is to move TRJLENGTH from automatically extracted data to compulsory in the AddData.py script. Then, all you need to change in AddData.py would be (I added comments to the old code, and added a new line)

try:
    # old:
    # u = Universe(top, traj)
    # new:
    u = Universe(top, traj, dt = traj_length/n_frames)
    u.atoms.write(gro, frames=u.trajectory[[0]]) #write first frame into gro file
except:
    #conf = str(dir_tmp) + '/conf.gro'
    print("Generating frame0.gro with Gromacs because MDAnalysis cannot read tpr version")
    if 'WARNINGS' in sim and sim['WARNINGS']['GROMACS_VERSION'] == 'gromacs3':
        os.system('echo System | trjconv -s '+ top + ' -f '+ traj + ' -dump 22000 -o ' + gro)
        #os.system('echo System | trjconv -s '+ top + ' -f '+ traj + ' -o ' + NewTraj)
        #u = Universe(gro, NewTraj)
    else:
        os.system('echo System | gmx trjconv -s '+ top + ' -f '+ traj + ' -dump 0 -o ' + gro)
    # old:
    # u = Universe(gro, traj)
    # new:
    u = Universe(gro, traj, dt = traj_length/n_frames)

    u.atoms.write(gro, frames=u.trajectory[[0]]) #write first frame into gro file

traj_length can be extracted from the .yaml file and n_frames can also be calculated in advance (with both GROMACS and MDAnalysis)

NMRLipids / Databank

How equilibrated are the trajectories? #2

126 Should solve the error, but it actually indicates that the system is not equilibrated

127 Should solve all these bugs