Open wehs7661 opened 4 years ago
Thank you @wehs7661! I think this may be due to a change in how multi-indexes are handled in pandas
when resetting indices on a Series
(see my comment); I have yet to investigate when this change may have happened in pandas
' release history, but it must be fixed regardless.
This part of the library is in need of some updates to fix this issue, as well as to make the subsamplers easier to use for u_nk
and dHdl
dataframes with multiple timeseries concatenated together and having multiple -lambda
indexes. When the subsamplers were originally written, we had a much smaller set of datasets in alchemtest
, and so the assumptions of their implementations are currently too constricted. They currently cannot handle expanded ensemble cases without a lot of pandas-fu beforehand, even if the error above is fixed.
To my understanding, the input parameter df of the subsampling methods should be the matrix/DataFrame extracted from the xvg file, while series could be any time series/column of df. I'm not sure if this understanding is correct, but it might be worthy to add more detailed explanations of these two parameters in the documentation.
Your understanding of the usage of these functions is correct! i generally choose the column of df
to feed in for series
to be one that is a "real" lambda state, since energy values for states in which, i.e. a substrate has disappeared, are insensitive to the state of the substrate. I don't think this is a hard rule, however, since achieving equilibrium is a whole-system-level statement. @mrshirts may have some thoughts on this.
Hi @dotsdl, it would be great if the issue will be fixed! Also thank you for the detailed explanation!
Regarding equilibriumDetection
: @VOD555 recently tried to use it for simple solvation free energy calculations and it really did not produce any useable results. Instead he came up with a different way to assess equilibration, as explained in [1], using forward and reverse convergence plots as a starting point, as also suggested in @davidlmobley et al's best practices paper [2]. Once you have determined how much initial data to throw away as equilibration you then use statisticalInefficiency
on the remaining data. This is crucial to get decorrelated data so that the simple variances that alchemlyb estimators produce are meaningful. (Eventually bootstrap estimates over decorrelated data #80 should help...) As suggested by [2], we've been using the dH/dlambda (TI) timeseries as input for the statisticalInefficiency
and then calculate both TI and MBAR for the subsampled data.
[1] S. Fan, B. I. Iorga, and O. Beckstein. Prediction of octanol-water partition coefficients for the SAMPL6- log P molecules using molecular dynamics simulations with OPLS-AA, AMBER and CHARMM force fields. Journal of Computer-Aided Molecular Design, e-print, 2020. https://doi.org/10.1007/s10822-019-00267-z
[2] P. V. Klimovich, M. R. Shirts, and D. L. Mobley. Guidelines for the analysis of free energy calculations. J Comput Aided Mol Des, 29(5):397–411, May 2015. https://doi.org/10.1007/s10822-015-9840-9
Hi @wehs7661, apologies for the radio silence. Are you still running into this issue with latest pandas
on the current release of alchemlyb
?
Hi @dotsdl, no worries at all, and thanks for following up! So I just tried first upgrading my version of pandas
(to version 1.11.0) and using alchemlyb
on the master branch instead, but I still got the same error. (I was originally using the one on the branch refactor-subsampling
and it worked for me.)
Cool, thank you for this feedback @wehs7661! This helps me to prioritize effort appropriately. I'll proceed with the refactor then.
Hi
alchemlyb
developers, Thank you all for the development of this package. I was usingalchemlyb
to analyze some output files from the simulation performed by GROMACS. When using the functionequilibrium_detection
with the input parametersdf
andseries
being the wholeu_nk
matrix extracted from a*dhdl.xvg
file (by the functionextract_u_nk
) and the first column ofu_nk
(which isu_nk[(0.0, 0.0, 0.0, 0.0)]
), I got the following error:Tracing back the error, I found that this error indicated that there was no
time
index in the seriesu_nk[(0.0, 0.0, 0.0, 0.0)]
. However,series.index
in this case did returntime
as one of the names of the indices. I am not sure if this is an issue of data types. I tried turning the data type ofu_nk[(0.0, 0.0, 0.0, 0.0)]
from Series into DataFrame, but same error occurred. Here I reproduced the error in a jupyter notebook, in which I also added some comments and explanations of the problem(alchemlyb_issue.zip).To my understanding, the input parameter
df
of the subsampling methods should be the matrix/DataFrame extracted from thexvg
file, whileseries
could be any time series/column ofdf
. I'm not sure if this understanding is correct, but it might be worthy to add more detailed explanations of these two parameters in the documentation.Thanks a lot!