alchemistry / alchemlyb

the simple alchemistry library
https://alchemlyb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
199 stars 51 forks source link

A problem with slicing in subsampling #95

Open wehs7661 opened 4 years ago

wehs7661 commented 4 years ago

Hi alchemlyb developers, Thank you all for the development of this package. I was using alchemlyb to analyze some output files from the simulation performed by GROMACS. When using the function equilibrium_detection with the input parameters df and series being the whole u_nk matrix extracted from a *dhdl.xvg file (by the function extract_u_nk) and the first column of u_nk (which is u_nk[(0.0, 0.0, 0.0, 0.0)]), I got the following error:

KeyError: Index(['time'], dtype='object')

Tracing back the error, I found that this error indicated that there was no time index in the series u_nk[(0.0, 0.0, 0.0, 0.0)] . However, series.index in this case did return time as one of the names of the indices. I am not sure if this is an issue of data types. I tried turning the data type of u_nk[(0.0, 0.0, 0.0, 0.0)] from Series into DataFrame, but same error occurred. Here I reproduced the error in a jupyter notebook, in which I also added some comments and explanations of the problem(alchemlyb_issue.zip).

To my understanding, the input parameter df of the subsampling methods should be the matrix/DataFrame extracted from the xvg file, while series could be any time series/column of df. I'm not sure if this understanding is correct, but it might be worthy to add more detailed explanations of these two parameters in the documentation.

Thanks a lot!

dotsdl commented 4 years ago

Thank you @wehs7661! I think this may be due to a change in how multi-indexes are handled in pandas when resetting indices on a Series (see my comment); I have yet to investigate when this change may have happened in pandas' release history, but it must be fixed regardless.

This part of the library is in need of some updates to fix this issue, as well as to make the subsamplers easier to use for u_nk and dHdl dataframes with multiple timeseries concatenated together and having multiple -lambda indexes. When the subsamplers were originally written, we had a much smaller set of datasets in alchemtest, and so the assumptions of their implementations are currently too constricted. They currently cannot handle expanded ensemble cases without a lot of pandas-fu beforehand, even if the error above is fixed.

To my understanding, the input parameter df of the subsampling methods should be the matrix/DataFrame extracted from the xvg file, while series could be any time series/column of df. I'm not sure if this understanding is correct, but it might be worthy to add more detailed explanations of these two parameters in the documentation.

Your understanding of the usage of these functions is correct! i generally choose the column of df to feed in for series to be one that is a "real" lambda state, since energy values for states in which, i.e. a substrate has disappeared, are insensitive to the state of the substrate. I don't think this is a hard rule, however, since achieving equilibrium is a whole-system-level statement. @mrshirts may have some thoughts on this.

wehs7661 commented 4 years ago

Hi @dotsdl, it would be great if the issue will be fixed! Also thank you for the detailed explanation!

orbeckst commented 4 years ago

Regarding equilibriumDetection: @VOD555 recently tried to use it for simple solvation free energy calculations and it really did not produce any useable results. Instead he came up with a different way to assess equilibration, as explained in [1], using forward and reverse convergence plots as a starting point, as also suggested in @davidlmobley et al's best practices paper [2]. Once you have determined how much initial data to throw away as equilibration you then use statisticalInefficiency on the remaining data. This is crucial to get decorrelated data so that the simple variances that alchemlyb estimators produce are meaningful. (Eventually bootstrap estimates over decorrelated data #80 should help...) As suggested by [2], we've been using the dH/dlambda (TI) timeseries as input for the statisticalInefficiency and then calculate both TI and MBAR for the subsampled data.

[1] S. Fan, B. I. Iorga, and O. Beckstein. Prediction of octanol-water partition coefficients for the SAMPL6- log P molecules using molecular dynamics simulations with OPLS-AA, AMBER and CHARMM force fields. Journal of Computer-Aided Molecular Design, e-print, 2020. https://doi.org/10.1007/s10822-019-00267-z

[2] P. V. Klimovich, M. R. Shirts, and D. L. Mobley. Guidelines for the analysis of free energy calculations. J Comput Aided Mol Des, 29(5):397–411, May 2015. https://doi.org/10.1007/s10822-015-9840-9

dotsdl commented 4 years ago

Hi @wehs7661, apologies for the radio silence. Are you still running into this issue with latest pandas on the current release of alchemlyb?

wehs7661 commented 4 years ago

Hi @dotsdl, no worries at all, and thanks for following up! So I just tried first upgrading my version of pandas (to version 1.11.0) and using alchemlyb on the master branch instead, but I still got the same error. (I was originally using the one on the branch refactor-subsampling and it worked for me.)

dotsdl commented 4 years ago

Cool, thank you for this feedback @wehs7661! This helps me to prioritize effort appropriately. I'll proceed with the refactor then.