EzryStIago commented 1 year ago

Recent updates (to preprocessing.py, I believe) have broken our scripts' ability to use decorrelation and equilibration detection for NAMD output. I have attached an MWE that includes some sample data from alchemtest. The most recent version that works for these test cases is Alchemlyb 0.6.0. I have included sample logs for runs using 0.6, 0.7, and 1.0

Finally, it looks like decorrelation should work on the entire dataframe based on the documentation, but we have had to separate the dataframe by fep-lambda. Otherwise, decorrelate_u_nk returns just fep-lambda=0.

Any clarification or advice would be helpful!

MWE: alchemlyb_crash_MWE.zip

xiki-tempula commented 1 year ago

@EzryStIago Hi Ezry, many thanks for reporting.

This is a tricky issue, part of the decorrelate_u_nk function is the sanitisation of the data, which drops all the rows that have NaN. Before 0.7 the code base for subsampling is a mess where some parts would drop the rows with NaN while others won't, the 0.7 standardised the subsampling module such that they are all behaving the same now, which means all functions would implicitly drop the rows that have NaN now.

In this case, the data frame is

                    0.0       0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0
time    fep-lambda                                                            
5000.0  0.0         0.0 -3.696651  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
5010.0  0.0         0.0 -3.643142  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
5020.0  0.0         0.0 -3.214734  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
5030.0  0.0         0.0 -4.236102  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
5040.0  0.0         0.0 -4.873010  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
...                 ...       ...  ...  ...  ...  ...  ...  ...  ...  ...  ...
49960.0 0.0         0.0 -2.994324  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
49970.0 0.0         0.0 -3.477750  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
49980.0 0.0         0.0 -3.645658  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
49990.0 0.0         0.0 -3.753682  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
50000.0 0.0         0.0 -3.558601  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
[4501 rows x 11 columns]

So this dropping NaN empties the data frame and cause the error.

Sorry, I'm not too familiar with the NAMD, I wonder if you mind giving some context of why are there so many NaNs?

@dotsdl I noticed that you have added the line of dropping the rows with NaN df = df.dropna() into slicing in https://github.com/alchemistry/alchemlyb/commit/a655f3a98250720bc9074ed800e85d543599d4a0#diff-82d2833cef709c2bd46823ce8a352cb3baac5d0a77386953ab3f538d908dd7a0R29 I wonder why did you do this? Thank you.

EzryStIago commented 1 year ago

Hi @xiki-tempula, thank you for your response.

NAMD doesn’t calculate dE for all lambda pairs, just the adjacent lambdas; those NaNs correspond to the remaining pairs. Of course that restricts us to BAR estimation.

xiki-tempula commented 1 year ago

@EzryStIago Many thanks for the explanation. I have done a fix where the pre-processing will no longer drop the NaN rows. Do you mind having a test to see if this fits your purpose? https://github.com/alchemistry/alchemlyb/pull/275

Finally, it looks like decorrelation should work on the entire dataframe based on the documentation, but we have had to separate the dataframe by fep-lambda. Otherwise, decorrelate_u_nk returns just fep-lambda=0.

Sorry, I don't quite understand this part. Do you mean that you have to make sure each dataframe only has one fixed lambda value? Instead of multiple different lambdas in one dataframe.

EzryStIago commented 1 year ago

Thank you @xiki-tempula!

275 does indeed behave as expected. I'm getting slightly different results for my free energy estimates (up to a tenth of a kcal/mol) and slightly larger equilibrated/decorrelated dataframes. They're still within error of the original estimates, though.

Do you mean that you have to make sure each dataframe only has one fixed lambda value? Instead of multiple different lambdas in one dataframe.

Exactly, perhaps this a peculiarity of NAMD, but all data are output to the same file by default. To use decorrelation or equilibrium detection, I first split the dataframe up by lambda, process, and reassemble. Is that the intended usage? If so, this is a non-issue.

xiki-tempula commented 1 year ago

@EzryStIago

Exactly, perhaps this a peculiarity of NAMD, but all data are output to the same file by default. To use decorrelation or equilibrium detection, I first split the dataframe up by lambda, process, and reassemble. Is that the intended usage? If so, this is a non-issue.

I see. This is kind of a tricky thing. We are currently only supporting a single lambda in subsampling.decorrelate_u_nk. I guess I need to state this in the documentation.

For NAMD, if all energy files are dumped to the same file, then they need to be separated and decoorelated separately. I think the best solution for this would be alchemlyb.parsing.namd.extract_u_nk to support an optional keyword which would return a list of dataframes instead of one dataframe.

I don't really know anything about the NAMD. I noticed that @jhenin and @ttjoseph has previously contributed to the NAMD parser. I wonder if there are any advice on what is the best way forward? Thanks.

jhenin commented 1 year ago

Thanks for your suggestion @xiki-tempula ! That seems doable, however it involves a change in the extract_u_nk API. A lighter process would be for the code calling extract_u_nk to split the dataframe into single lambdas, pass them to decorrelate_u_nk and then concat them back. We could add a wrapper function alchemlyb.parsing.namd.decorrelate that does this. Even better maybe, this step could be integrated into decorrelate_u_nk to make the whole process transparent and robust to users with custom workflows passing multi-lambda dataframes.

EzryStIago commented 1 year ago

Thank you @xiki-tempula!

alchemistry / alchemlyb

NAMD - decorrelation and equilibration detection fails in v0.7 and 1.0 #274

275 does indeed behave as expected. I'm getting slightly different results for my free energy estimates (up to a tenth of a kcal/mol) and slightly larger equilibrated/decorrelated dataframes. They're still within error of the original estimates, though.