alchemistry / alchemlyb

the simple alchemistry library
https://alchemlyb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
189 stars 49 forks source link

Behaviour of amber parser on file with duplicated time #315

Closed xiki-tempula closed 1 year ago

xiki-tempula commented 1 year ago

In many occasions, one would need to run simulation in chunks. For 10 ns simulation, one run the first ns, the seconds ns and so on. For Gromacs, this is easy and the output would be a continuous file. However, given that AMBER doesn't have a checkpoint continue mechanism. A convention to do continuation would be that one generate a number of amber output files and concatenate the text files together. This would result in

 NSTEP =      200   TIME(PS) =      40.400  TEMP(K) =   300.75  PRESS =     0.0
 Etot   =    -17617.2822  EKtot   =      3869.1100  EPtot      =    -21486.3921
 BOND   =         7.5660  ANGLE   =       100.4182  DIHED      =         6.3933
 1-4 NB =         6.9516  1-4 EEL =       -34.1688  VDWAALS    =      3077.9556
 EELEC  =    -24653.8963  EHBOND  =         0.0000  RESTRAINT  =         2.3882
 EAMBER (non-restraint)  =    -21488.7803
 EKCMT  =         0.0000  VIRIAL  =         0.0000  VOLUME     =     65829.0102
                                                    Density    =         0.9878
 TEMP0  =       298.0000  REPNUM  =              1  EXCHANGE#  =              1
 ------------------------------------------------------------------------------

...

MBAR Energy analysis:
Energy at 0.0000 =    -21510.840230
Energy at 0.0667 =    -21507.963581
Energy at 0.1333 =    -21490.171333
Energy at 0.2000 =    -21448.494884
Energy at 0.2667 =    -21379.504007
Energy at 0.3333 =    -21284.274183
Energy at 0.4000 =    -21167.309140
Energy at 0.4667 =    -21035.636518
Energy at 0.5333 =    -20897.816290
Energy at 0.6000 =    -20762.961241
Energy at 0.6667 =    -20640.032587
Energy at 0.7333 =    -20537.016108
Energy at 0.8000 =    -20460.060524
Energy at 0.8667 =    -20412.262820
Energy at 0.9333 =    -20391.474269
Energy at 1.0000 =    -20388.088606
 ------------------------------------------------------------------------------
...
NSTEP =      200   TIME(PS) =      40.400  TEMP(K) =   300.75  PRESS =     0.0
 Etot   =    -17617.2822  EKtot   =      3869.1100  EPtot      =    -21486.3921
 BOND   =         7.5660  ANGLE   =       100.4182  DIHED      =         6.3933
 1-4 NB =         6.9516  1-4 EEL =       -34.1688  VDWAALS    =      3077.9556
 EELEC  =    -24653.8963  EHBOND  =         0.0000  RESTRAINT  =         2.3882
 EAMBER (non-restraint)  =    -21488.7803
 EKCMT  =         0.0000  VIRIAL  =         0.0000  VOLUME     =     65829.0102
                                                    Density    =         0.9878
 TEMP0  =       298.0000  REPNUM  =              1  EXCHANGE#  =              1
 ------------------------------------------------------------------------------

...

MBAR Energy analysis:
Energy at 0.0000 =    -21510.840230
Energy at 0.0667 =    -21507.963581
Energy at 0.1333 =    -21490.171333
Energy at 0.2000 =    -21448.494884
Energy at 0.2667 =    -21379.504007
Energy at 0.3333 =    -21284.274183
Energy at 0.4000 =    -21167.309140
Energy at 0.4667 =    -21035.636518
Energy at 0.5333 =    -20897.816290
Energy at 0.6000 =    -20762.961241
Energy at 0.6667 =    -20640.032587
Energy at 0.7333 =    -20537.016108
Energy at 0.8000 =    -20460.060524
Energy at 0.8667 =    -20412.262820
Energy at 0.9333 =    -20391.474269
Energy at 1.0000 =    -20388.088606
 ------------------------------------------------------------------------------

So you have duplicated time points in the same input file. The current behaviour is that when we reach the end of the first file 5. TIMINGS, we close the file parser, so the rest of the information is gone, which is not ideal.

There are several ways of solving this problem. The easiest way would be to just to remove this break https://github.com/alchemistry/alchemlyb/blob/master/src/alchemlyb/parsing/amber.py#L361 This would allow the parser to continue. On the other hand, instead of logging the data for 0~1 ns, 0~1 ns, 0~1 ns. This would log 0~1 ns, 1~2 ns, 2~3 ns. In some sense this would mean that we are not reading the data as it is. On the other hand, this would probably be the expected behaviour in this case so this might be fine.

The more involved way of solving this problem might be to log the data as it is, For example, 0~1 ns, 0~1 ns, 0~1 ns. Then the user might need to change the time point by themselves.

I'm in favour of the easy solution but is not sure of this would create some additional problem @DrDomenicoMarson Here is an example input file , note that only 50 data points are loaded. amber.out.zip

DrDomenicoMarson commented 1 year ago

Hi @xiki-tempula , In dealing with the exact situation you are describing I usually just read each output file from the same simulation and concatenate them via concat from alchemlyb, without any issue!

The time for each file is not directly read from the output file, but is computed from the starting time, the dt, and the steps between output.

The only problem may arise if one tries to concatenate files from different simulations, but this is the case also for GROMACS I think!

xiki-tempula commented 1 year ago

@DrDomenicoMarson Thanks. Do you think it would be fine if I just remove https://github.com/alchemistry/alchemlyb/blob/master/src/alchemlyb/parsing/amber.py#L361 Then issue a warning if there are more data after the finish?

DrDomenicoMarson commented 1 year ago

I don't have the time to test this right now, but I think it shouldn't be an issue!

I think there should be an Error and not a Warning, if data is found after the finish because I don't know if the logic that is used now would deal with a single file that is a concatenation of several output AMBER files though.

xiki-tempula commented 1 year ago

Close this via the concatenation of the pandas dataframe #316

xiki-tempula commented 1 year ago

An error should be thrown if the input is a concatenated amber output file