Becksteinlab / MDPOW

Calculation of water/solvent partition coefficients with Gromacs.
https://mdpow.readthedocs.io
GNU General Public License v3.0
25 stars 10 forks source link

Adding Analysis Module #168

Closed ALescoulie closed 3 years ago

ALescoulie commented 3 years ago

MDPOW-Analysis Module

Features

I would like to add an analysis module to MDPOW that takes the files in a FEP directory and runs different analysis functions in a modular framework, and saves results to csv files to an ANALYSIS directory in the molecule directory. The eventual goal being to use the data in a decision tree to predict convergence.

The goal is setting this up in such a way that an analysis can be easily run with a script or with in interactive just by passing in some kwargs like the example below.

run_analysis(directory='Benzene_mdpow', molecule='benzene', 
             solvents=['water', 'octanol'], H_bond=False, dihedral_selections=[])

I plan to run dihedral analysis, hydrogen bonding analysis, and solvation shell analysis. The analyses run are defiantly expandable.

Implementation

To do this I plan to implement an analysis class that takes either a simulation path or GSolv object and loads the systems at each lambda into MDAnalysis universes. The hope is to assemble universes in an ensemble similar to this pull request in MDAnalysis. I don't think my implementation has to be as detailed since mostly it will be used for storing universes in an organized way and getting atom group selections of a set of similar universes in a more concise way. With the ensemble the systems for each lambda and each solvent can be more easily analyzed as a group.

As for the analysis most it probably is simplest to set it up based on the AnalysisBase class. Please leave any suggestions you have for how better to implement this. I'm not yet committed to any one idea for implementation.

To Do

I'll update this issue as I have progress on implementing these features.

VOD555 commented 3 years ago

We also can separate the analysis input keywords from the fep section of the .yml file, and add a new analysis section.

orbeckst commented 3 years ago

I would not have run_analysis perform multiple analysis tasks. Keep it simple along the lines of

class EnsembleAnalysis:
      """base class to perform an analysis tasks on all simulations of a FEP calculation"""

Then you can derive your different analyses from EnsembleAnalysis.

This will then easily allow you to run multiple EnsembleAnalysis over different molecules or analysis tasks. Think of coding building blocks instead of complete solutions.

orbeckst commented 3 years ago

To add: think of the hierarchy of simulations. I would consider the "ensemble" the smallest set of simulations that is necessary to calculate a solvation free energy, i.e., for a given solvent, all the Coulomb and VDW windows. That's the base unit to work with for an ensemble. So you need to first write analysis code to analyze a single simulation (ideally, code based on MDAnalysis.analysis.base.AnalysisBase for new code or using existing analysis classes (eg for dihedrals you can derive your own class from analysis.dihedrals.Dihedral.) Then the EnsembleAnalysis runs the analysis class over the individual trajectories. Initially that can just be in serial. Once you have serial code, we can think about parallelizing.

orbeckst commented 3 years ago

Finally: If you start with small units then it's a lot easier to write tests for them. In particular, I'd start with one (or two) AnalysisBase-based analysis class(es) because it/they will be immediately useful and can be easily tested.

Don't slow yourself down too much with thinking about tests – get to a prototype quickly.

orbeckst commented 3 years ago

It would be nice if the following would eventually work:

from mdpow.analysis.solvation import NearestSolvent

# run analysis to get number of water molecules in first solvation shell
nearest_solvent = NearestSolvent("benzene/water.pickle", selection="name OH", r1=3.0).run()

# plot the number of OH within 3 Å of the solute as violinplots split by interaction and for each lambda
# from the tidy dataframe (contains: interaction, lambda, time, n) for all simulations
seaborn.catplot(data=nearest_solvent.results.timeseries, 
                           x="lambda", y="n", col="interaction",
                           kind="violin", ...) 

The EnsembleAnalysis would assemble all the individual time series (stored as CVS files with each simulation in the analysis directory) as a tidy (but huge) pandas.DataFrame . It is then easy to analyze the data in many different ways and also concatenate these dataframes for higher order analysis (e.g., add a solvent column , a molecule_id column, and a forcefield column and then one can aggregate over all the data in SAMPL7).

ALescoulie commented 3 years ago

Analysis Modules

Now that the Ensemble objects in PR #179 are just about finished, I thought I'd come back to this issue to layout next steps. To start with I'd like to implement three analyses.

For the analysis modules some things need to be worked out.

I'd appreciate some input on how to build the DataFrames in a way that the data is ready for use immediately, and on how the user should interface with this module.

orbeckst commented 3 years ago

For data structures I recommend tidy data. One observation - one row. If you have N columns then N-1 columns are "tags" such as solvent, interaction, lambda, timestep, molecule, dihedral, .... and one column is the actual observation. This makes it easy to aggregate data using pandas groupby and similar tricks or directly plot with seaborn's categorical plots.

You might think that this wastes space by repeating information in the dataframe but experience shows that this is generally a no-issue compared to the ease of processing the data.

(There's more to read about tidy data, including Hadley Wickham's article https://www.jstatsoft.org/article/view/v059i10/ .)

orbeckst commented 3 years ago

For hydrogen bonds I would just try to wrap the MDAnalysis analysis class. Look at the datastructure produced from the HydrogenBondAnalysis class and see if you want to keep it like it is or make it part of a tidy dataframe.

As a guiding principle you want to do as little as data gymnastics as possible for analysis, i.e., avoid having custom data wrangling code like "average_by_lambdas" which then pulls data out of custom dictionaries. Instead leverage what's present in pandas. Learn about groupby and friends.

orbeckst commented 3 years ago

Initially, don't worry about specifying analysis parameters. Focus on the building blocks, which take parameters as args or kwargs and nothing else.

Once we work on the user interface (in the form of the CLI commands) then I would go with @VOD555 's suggestion https://github.com/Becksteinlab/MDPOW/issues/168#issuecomment-888585347 to add a analysis section to the runinput.yml.

orbeckst commented 3 years ago

@ALescoulie I suggest you open a new issue for new analysis modules by transferring your https://github.com/Becksteinlab/MDPOW/issues/168#issuecomment-916264923 . I would then open individual issues for the analysis tools that you want to build. Keep things modular and separate.

Although I didn't intent to close this issue with PR #179, it make sense. I re-read the issue and PR #179 addresses pretty much all points. I will add the issue number to CHANGES.