Adding WHAM to MDAnalysis

fiona-naughton commented 8 years ago

It would be useful to have an implementation of WHAM with MDAnalysis to be able to run alongside other analysis of a set of trajectories from Umbrella Sampling (US), either a wrapper for an existing version or starting from scratch.

As I see it, it'd accept an array of pull force/reaction coordinate data from US simulations (and the relevant temperature/restraining potential/etc information), or the list of files that contain that data, and returning the PMF profile, optionally with estimated errors.

There are several existing WHAM implementations floating around that do more or less this, particularly:

g_wham, distributed with GROMACS but doesn't require simulations to have been run using GROMACS (license: LGPL)
Alan Grossfield’s standalone C implementation (license: GPL/BSD).
PyWham, a python implementation (license: GPL)

There’s also a python implementation that uses the related UWHAM algorithm here (license: GPL). This supposedly converges faster than the original WHAM equations that (as far as I can tell) the above three implementations use, though g_wham at least has several measures in place to aid acceleration of convergence.

Some other points that might be worth considering:

Error analysis: Only g_wham and the Grossfield implementation offer error estimation, in the form of bootstrap analysis, though (in my experience) this tends to underestimate. There's at least one alternative proposed here, though I'm not sure how it'd compare in general. That paper also has some WHAM variations that like UWHAM above supposedly converge faster, but don't seem to have available implementations.
Multiple dimensions/reaction coordinates: There’s a 2D Grossfield version, and PyWham can handle any number of dimensions/reaction coordinates

jbarnoud commented 8 years ago

I had a brief look at the implementations you listed.

g_wham comes tied with gromacs; this is clearly a dependency we do not want for MDAnalysis
pywham looks good feature wise distributing it would be tricky though: it is not on pypi, and it is GPLv3 while MDAnalysis is GPLv2
Grossfield's implementation seems to be the best choice in term of distributing it; the license is permissive enough so we can distribute a copy with MDAnalysis; on the other hand, this implementation is the least maintained.

I also had a very quick look at UWHAM. It seems to be python 2.7 only, while we try to be compatible with python 3. If UWHAM is really worth the shot, we could work with them to make their code python 3 compatible, and push them so UWHAM got submitted on pipy.

Based on that, I would go for Grossfield's implementation. I'll have a deeper look at UWHAM latter on.

kain88-de commented 8 years ago

Grossfields license is super restrictive because he is using code from numerical recipes. We would need to ask the numerical recipes authors for conformation. This is a little bit hidden information in his readme. The other stuff is indeed BSD licensed, I'm not sure how much work it is to find out what functions are from numerical recipes.

jbarnoud commented 8 years ago

Good catch.

There are two functions from the numerical recipes. They are in a separate directory so finding them is not an issue; replacing them should be doable but would require careful testing, and will probably lead to a lost in perf.

On 06-05-16 13:34, kain88-de wrote:

Grossfields license is super restrictive because he is using code from numerical recipes. We would need to ask the numerical recipes authors for conformation. This is a little bit hidden information in his readme. The other stuff is indeed BSD licensed, I'm not sure how much work it is to find out what functions are from numerical recipes.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/842#issuecomment-217417407

kain88-de commented 8 years ago

probably lead to a lost in perf.

Not necessarily. The numerical recipes stuff is full of errors and bugs. Besides for fast code modern cpu's have different requirements.

kain88-de commented 8 years ago

Interesting paper about WHAM. I'm not aware the tests suggested in this paper are implemented in any of the standard WHAM tools Convergence and error estimation in free energy calculations using the weighted histogram analysis method

orbeckst commented 8 years ago

Btw, if want to do something with Grossfield's wham code then I can talk to him about it, I know Alan pretty well.

orbeckst commented 8 years ago

@fiona-naughton, can you outline how you would like to use wham with MDAnalysis, i.e., a mock up of what commands you would run if some form of wham was implemented?

Would this rely on using auxiliary information in trajectories #785?

I want to get a feel for what a user would gain and how it would be simpler as opposed to, say, extract sampled CVs from trajectories or output files, write to data files, and then feed to 3rd party tools (been there, done it, bought the T-shirt, published the paper).

fiona-naughton commented 8 years ago

I'd like something like profile = wham(window_data, restraint_info, start_time, end_time, [other options…]). I am mainly picturing it working with auxiliaries added per #785 / having all the window trajectories grouped per #843 , so profile = us_system.wham() works; or might be nice to allow to specify external files.

I currently instead do this as you described, and while it does work, if I want to run WHAM as part of a larger analysis script I have to run a shell command from Python to run WHAM (I’m using g_wham), then parse in the resulting file(s). This seems a little nasty, and trying to find out what went wrong when the bash-command-from-Python failed doesn’t seem straightforward (this might be particular of g_wham, though?). I also think it would be nice if the user had to worry less about the right file formatting.

In any case, It’d also be nice to see something like check_convergence(time_interval, …) which would run WHAM multiple times over different time intervals to show how/if the profile (specifically well depth?) is changing, and suggest a ‘convergence time’ based on when this falls below a specified cutoff. You can also calculate averages along reaction coordinate of other properties based on some of the stuff calculated during WHAM, so it’d be nice to be able to do this given some function to calculate a particular property in MDanalysis (us_system.property_average(function, args)?).

orbeckst commented 8 years ago

From my experience so far, any analysis done on groups of trajectories can be done well within the MDSynthesis frame work. I think I would build a convenient wham analysis workflow on top of MDSynthesis because @dotsdl already solved many of the problems. This will still require the aux reader, of course. It would be convenient to have a native wham implementation (but if there's not enough time then calling an external command is still a good alternative option).

I like the idea of using MDAnalysis to easily reweight any quantity calculated from windows. It would be nice if this was easy as

reweighted_timeseries = reweight(simulations, analysis_function, pmf_function, ...)

where analysis_function(ts) returns the calculated quantity (eg a distance d_simulated) and the PMF reaction coordinate value xi so that pmf(xi) gives the free energy so that you can calculate d(xi) = d_simulated(xi) * exp(-pmf(xi)/kT) / integrate(-pmf(xi)/kT, min(xi), max(xi)) (or something along those lines

I also think it would be nice if the user had to worry less about the right file formatting.

Yes, MDAnalysis attempts to be format-agnostic, i.e., once you have your (aux) trajectories you should be able to run the same kind of analysis no matter how you got the trajectories.

lukas-stelzl commented 8 years ago

Hi Fiona,

The PyEMMA project also contains WHAMM. http://emma-project.org/latest/api/index_thermo.html

I played around with the test cases yesterday and I found it looks promising.

fiona-naughton commented 8 years ago

Starting up the discussion here again so we can start getting something worked out while I finish up the auxiliary/bundle stuff.

Not sure if there is a consensus about wrapping an existing implementation? In terms of functionality they seem to be similar enough, but not sure if any would work best with MDAnalysis? (thanks @lukas-stelzl for pointing out PyEMMA – it looks a little less straightforward that the others but I'll have another proper look). I'd maybe lean towards Grossfield?

A WHAM anaylsis function could subclass a more generic bundle-analysis function, which accepts a bundle (from #900) and checks that 'expected' auxiliary/metadata specified in the analysis function are present for each trajectory before running; also passing in the actual names of the expected data in the bundle if not the defaults.

So using WHAM might look like:

profile = WHAM(bundle, temp='T', restraint_constant='k', restrained_value='x', 
               pull_force='pullf', **kwargs)

(With additional kwargs to pass to the WHAM implementation – start time, end time, etc). It could also include an option for calculating errors, depending on implementation (could return a tuple of profile, error), and we can add other functions to e.g re-weight the reaction coordinate like as @orbeckst suggested.

Doesn't really apply to WHAM, but in general for other Bundle-analysis where we're running over each window in turn we can have a Bauhaus approach like in #719.

mnmelo commented 8 years ago

@jbarnoud and I have been looking into those numerical recipes pieces of code and they should be straightforward to replace. (And maybe even advisably so, at least in the case of the ran2 random number generator)

orbeckst commented 8 years ago

On 20 Jul, 2016, at 09:13, mnmelo notifications@github.com wrote:

@jbarnoud and I have been looking into those numerical recipes pieces of code and they should be straightforward to replace. (And maybe even advisably so, at least in the case of the ran2 random number generator)

Ok sounds good.

Alternative: cythonize WHAM and then you can use numpy arrays instead of dealing withg memory management at the C level (which is a lot of the NR code about, IIRC).

Oliver Beckstein * orbeckst@gmx.net skype: orbeckst * orbeckst@gmail.com

lukas-stelzl commented 8 years ago

An alternative to PyEmma is pymbar and I think quite a lot of development went into it. https://github.com/choderalab/pymbar/tree/master/examples . Also has an Umbrella Sampling test case. MBAR is closely related to WHAMM. http://www.alchemistry.org/wiki/Multistate_Bennett_Acceptance_Ratio

mnmelo commented 8 years ago

Does pymbar do WHAM for the Umbrella Sampling? Or it just extends the MBAR concept?

orbeckst commented 8 years ago

Ask @dotsdl about his experience with the code base…

On 20 Jul, 2016, at 10:35, lukas-stelzl notifications@github.com wrote:

An alternative to PyEmma is pymbar and I think quite a lot of development went into it. https://github.com/choderalab/pymbar/tree/master/examples . Also has an Umbrella Sampling test case. MBAR is closely related to WHAMM. http://www.alchemistry.org/wiki/Multistate_Bennett_Acceptance_Ratio

Oliver Beckstein * orbeckst@gmx.net skype: orbeckst * orbeckst@gmail.com

dotsdl commented 8 years ago

@lukas-stelzl iirc WHAM is related to MBAR in the sense that it will give the same answer with infinite sampling, but numerically it is a completely different method.

pymbar only gives a Python implementation of MBAR. I don't think it will help at all for doing anything with WHAM.

jbarnoud commented 8 years ago

My take on it would be to start with a wrapper around the Grossfield's WHAM binary. It would be an interface that would build the input files for Grossfield's WHAM, run the program, read the output, and present them to the user. This is, I think, the best way to have something that work as soon as possible. We can then discuss the user interface on something that we can actually use.

Latter on, we could interface directly with the C source so we do not have to create the intermediate files. Also, once we have a working user interface, we can adapt it to other backend/method if needed.

orbeckst commented 8 years ago

On 21 Jul, 2016, at 03:07, Jonathan Barnoud notifications@github.com wrote:

My take on it would be to start with a wrapper around the Grossfield's WHAM binary. It would be an interface that would build the input files for Grossfield's WHAM, run the program, read the output, and present them to the user. This is, I think, the best way to have something that work as soon as possible. We can then discuss the user interface on something that we can actually use.

Sounds like a good plan to get things rolling. It also won’t require us to immediately look at licensing issues (although for testing we will probably need to install wham).

(Many years ago I wrote a wrapper for input file generation for wham and although you’re are likely better off starting fresh, I nevertheless uploaded the code to github so that anyone can have a look if they like: https://github.com/orbeckst/AdK_analysis/blob/master/python/PMF/wham.py – the code is perhaps a good example why many scientists are reluctant to share code used in papers because it’s ugly, not general, not tested… but perhaps it can be useful nevertheless.)

Latter on, we could interface directly with the C source so we do not have to create the intermediate files. Also, once we have a working user interface, we can adapt it to other backend/method if needed.

Good plan. You can then also look at higher-dimensional wham (not just 1D like wham or 2D like wham2d).

Oliver Beckstein * orbeckst@gmx.net skype: orbeckst * orbeckst@gmail.com

fiona-naughton commented 8 years ago

Thanks all; apologies it's taken me a while to get back around to this.

So looks like I'll start working with the Grossfield wham. Looking at running from the command line, there's quite a few things required, but most of them we could have as optional + set defaults for (the number of bins + tolerance to use; the values for optional bootstrap error analysis) or calculate values for (the max/min reaction coordinate values).

If the simulations all have the same temperature, what we'd need at least would be the spring constant, value of the reaction coordinate restrained to, and value of the reaction coordinate at each time point, all for each window. In a MDAnalysis wham, our input would have...

names for metadata values of the spring constant and reaction coordinate for each window
name of auxiliary data...
...and whether it's the pull force or reaction coordinate value; if the former we'll convert to the latter before passing on
a start/end time [optional, default to full range] - and only pass on auxiliary data between these
a temperature to use - either specify, or provide a name for a 'temperature' metadata?
whether to calculate bootstrap errors
[various options to pass on]

(If the simulations have different temperatures, there also needs to be a potential energy value for each time for each window, which'd be another auxiliary, and a specified temperature to calculate at and a metadata name for the temperature at each window.)

The main parts of the output are the profile, the error (if boostrap anaylsis was used), and 'F values' for each window (used to calculate weighted averages for other properties). I'm not sure what the best return format would be, since the error is optional, and I don't know how often the 'F values' be used?

jbarnoud commented 8 years ago

Sounds reasonable. How would the API looks like? Could you show what a call would look like, and what wham command it would actually run?

fiona-naughton commented 8 years ago

I think extending from what I had above:

wham(bundle)
wham(bundle, start_time=1000)
wham(bundle, temp='T', restraint_constant='k', restrained_value='x', 
               pull_force='pullf')
wham(bundle, bootstrap=True)
wham(bundle, n_bins=500, min=1, max=5, periodic='radians')
wham(bundle, run_temp=300)

which would run the equivalent to the following from the command line, respectively (I'm taking here the same default values g_wham uses for e.g. the number of bins; default min/max would have to be calculated):

wham [min] [max] 200 1e-06 [temp] 0 [input file] [output file]   
  # where data for input file is taken from metadata/auxiliary assuming default names
  # temp should be the same for each simulation
wham [min] [max] 200 1e06 [temp] 0 [input file] [output file]
  # input data of reaction coord values only uses time points from 1000ps.
wham [min] [max] 200 1e-06 [T] 0 [input file] [output file]   
  # where data for input file is taken from metadata/auxilary with the provided names
wham [min] [max] 200 1e-06 [temp] 0 [input file] [output file] 200 [seed]
wham Ppi 1 5 500 1e-06 [temp] 0 [input file] [output file]
wham [min] [max] 200 1e-06 300 0 [input file] [output file]
  # temp can be different for each simulation

The return could be a 2 x n_bins array without bootstrap error analysis, and 3 x n_bins with (X, Y, dY); though I'm still not sure how/if to fit in the 'F values'.

(There's also the 2d wham, but probably better stick with the 1d for now)

jbarnoud commented 8 years ago

min and max could have more explicit names, it would avoid using the name of a builtin. I would advocate for bootstrap being True by default.

I would also add a keep_files argument that cause the files produced for and by WHAM not to be deleted.

Finally, Grossfield's WHAM can be compiled to use kJ/mol or kcal/mol: we probably need an option to deal with this.

Anyway, it looks good to me. I am impatient to be able to play with this.

orbeckst commented 7 years ago

See comments on #900 and #923 : this is better implemented as a separate small package.

Many thanks to @fiona-naughton who gave us the infrastructure (aux) to make this (and other cool things...) possible!

MDAnalysis / mdanalysis

Adding WHAM to MDAnalysis #842