alchemistry / alchemlyb

the simple alchemistry library
https://alchemlyb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
189 stars 49 forks source link

AMBER parser had a "too strict" regex to extract values #273

Closed DrDomenicoMarson closed 1 year ago

DrDomenicoMarson commented 1 year ago

This PR addresses issue #272. This issue exposed a nasty bug inside the regex pattern we used to extract sections in the amber output file.

we are trying to match a sequence of type

filed = int/float

extracting the int/float corresponding to the field. Up till now, the pattern was

fr' {field}\s+=\s+(\*+|{_FP_RE}|\d+)'

were field was the string we want to extract the value for, and _FP_RE is defined to extract the float/int after the equal.

The problem arises when the field and/or the value are not separated from the = sign by a space (as was the case in the issue). It's strange this didn't appear before!

I just changed the + in \s+=\s+ to *.

Thinking about it, this should not break anything else, but I can't be 100% sure and the tests should catch any "macro" problems eventually introduced with this PR.

codecov[bot] commented 1 year ago

Codecov Report

Merging #273 (362b2a3) into master (6e19d4e) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #273   +/-   ##
=======================================
  Coverage   98.69%   98.69%           
=======================================
  Files          26       26           
  Lines        1759     1762    +3     
  Branches      378      379    +1     
=======================================
+ Hits         1736     1739    +3     
  Misses          3        3           
  Partials       20       20           
Impacted Files Coverage Δ
src/alchemlyb/parsing/amber.py 99.15% <100.00%> (+0.01%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

xiki-tempula commented 1 year ago

Many thanks for the quick fix. Do you mind add a new file to alchemtest to make sure that we don't break it in the future? Thanks. A sample file as short as

          -------------------------------------------------------
          Amber 20 PMEMD                              2020
          -------------------------------------------------------

--------------------------------------------------------------------------------
   2.  CONTROL  DATA  FOR  THE  RUN
--------------------------------------------------------------------------------

BioSimSpace Syst                                                                

General flags:
     imin    =       0, nmropt  =       0

Replica exchange
     numexchg=     100, rem=       3

Nature and format of input:
     ntx     =       1, irest   =       0, ntrx    =       1

Nature and format of output:
     ntxo    =       2, ntpr    =     200, ntrx    =       1, ntwr    =   25000
     iwrap   =       1, ntwx    =   25000, ntwv    =       0, ntwe    =       0
     ioutfm  =       1, ntwprt  =       0, idecomp =       0, rbornstat=      0

Potential function:
     ntf     =       1, ntb     =       2, igb     =       0, nsnb    =      25
     ipol    =       0, gbsa    =       0, iesp    =       0
     dielc   =   1.00000, cut     =   8.00000, intdiel =   1.00000

Frozen or restrained atoms:
     ibelly  =       0, ntr     =       0

Molecular dynamics:
     nstlim  =      1250, nscm    =      1000, nrespa  =         1
     t       =   0.00000, dt      =   0.00400, vlimit  =  -1.00000

Langevin dynamics temperature regulation:
     ig      =  771064
     temp0   = 298.00000, tempi   = 298.00000, gamma_ln=   2.00000

Pressure regulation:
     ntp     =       1
     pres0   =   1.01325, comp    =  44.60000, taup    =   1.00000
     Monte-Carlo Barostat:
     mcbarint  =     100

SHAKE:
     ntc     =       2, jfastw  =       0
     tol     =   0.00001

Free energy options:
     icfe    =       1, ifsc    =       1, klambda =       1
     clambda =  0.0000, scalpha =  0.5000, scbeta  =  1.0000
     sceeorder =       2
     dynlmb =  0.0000 logdvdl =       1

FEP MBAR options:
     ifmbar  =       1,  bar_intervall =      100
     mbar_states =       4

| Intermolecular bonds treatment:
|     no_intermolecular_bonds =       1

| Energy averages sample interval:
|     ene_avg_sampling =     200

Ewald parameters:
     verbose =       0, ew_type =       0, nbflag  =       1, use_pme =       1
     vdwmeth =       1, eedmeth =       1, netfrc  =       1
     Box X =   42.856   Box Y =   42.856   Box Z =   42.856
     Alpha =   90.000   Beta  =   90.000   Gamma =   90.000
     NFFT1 =   48       NFFT2 =   48       NFFT3 =   48
     Cutoff=    8.000   Tol   =0.100E-04
     Ewald Coefficient =  0.34864
     Interpolation order =    4

| PMEMD ewald parallel performance parameters:
|     block_fft =    0
|     fft_blk_y_divisor =    2
|     excl_recip =    0
|     excl_master =    0
|     atm_redist_freq =  320
     TI Mask 1 @1-33; matches     33 atoms
     TI Mask 2 @34-70; matches     37 atoms
     TI region 1:    7780 atoms
     TI region 2:    7784 atoms
     SC Mask 1 @29-33; matches      5 atoms
     SC Mask 2 @62-70; matches      9 atoms

    MBAR - lambda values considered:
       4 total:  0.0000 0.3333 0.6667 1.0000
    Extra energies will be computed     12 times.

|--------------------------------------------------------------------------------------------
| Extra TI control variables
|     gti_add_sc     =   5, gti_ele_gauss  =   0, gti_auto_alpha =   0, gti_scale_beta =   1
|     gti_ele_exp    =   2, gti_vdw_exp    =   2, gti_ele_sc     =   1, gti_vdw_sc     =   1
|     gti_cut        =   1, gti_cut_sc     =   2
|     gti_cut_sc_on    =  6.0000, gti_cut_sc_off    =  8.0000
|--------------------------------------------------------------------------------------------

| MONTE CARLO BAROSTAT IMPORTANT NOTE:
|   The Monte-Carlo barostat does not require the virial to adjust the system volume.
|   Since it is an expensive calculation, it is skipped for efficiency. A side-effect
|   is that the reported pressure is always 0 because it is not calculated.

--------------------------------------------------------------------------------
   3.  ATOMIC COORDINATES AND VELOCITIES
--------------------------------------------------------------------------------

BioSimSpace Syst                                                                
 begin time read from input coords =     0.000 ps

| REMD: Pressure/volume correction to exchange calc active for TREMD/HREMD.
     Molecule     1 is partially softcore
     Molecule     2 is partially softcore
 Number of triangulated 3-point waters found:     2581
 Number of shake restraints removed in TI region  1 :        0
 Number of shake restraints removed in TI region  2 :        0

     Sum of charges for TI region  1 =  -0.00000004
     Skip neutralizing charges...

     Sum of charges for TI region  2 =  -0.00000000
     Skip neutralizing charges...

| Dynamic Memory, Types Used:
| Reals              664467
| Integers           534749

| Nonbonded Pairs Initial Allocation:     1782276

| GPU memory information (estimate):
| KB of GPU memory in use:     52270
| KB of CPU memory in use:     17998

| Running AMBER/MPI version on    1 MPI task

--------------------------------------------------------------------------------
   4.  RESULTS
--------------------------------------------------------------------------------

 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

MBAR Energy analysis:
Energy at 0.0000 =    -24991.812173
Energy at 0.3333 =    -24988.328441
Energy at 0.6667 =    -23067.164349
Energy at 1.0000 =     31035.752870
 ------------------------------------------------------------------------------

 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

MBAR Energy analysis:
Energy at 0.0000 =    -25070.041628
Energy at 0.3333 =    -25073.157930
Energy at 0.6667 =    -25042.604724
Energy at 1.0000 =    -24808.803339
 ------------------------------------------------------------------------------

Would be fine.

DrDomenicoMarson commented 1 year ago

I don't think we need this file in particular. As it is, it's just like some of the "right" files in the current alchemtest.

The thing is, before this fix to extract the value from a field = value couple, a \s before and after the equal sign were mandatory. With the suggested new regex, the \s are no more mandatory, and so the case filed=value is caught by the regex.

I don't know which tests could be created to specifically check this as if this fails, almost all the tests should fail directly.

xiki-tempula commented 1 year ago

@DrDomenicoMarson So one purpose of the test is to ensure that the fix works. The other purpose of the test is to ensure that it will work in the future. Say ten years into the future and some guy will come in and say the current amber parser is not good enough and need a refactor. We need to make sure that this new refactored parser would handle all the difficult cases that we had trouble before. Having a test would ensure that this case would be handled correctly in the future.

In this case, people would look at the current fr' {field}\s+=\s+(\*+|{_FP_RE}|\d+)' and say it looks good, we now know that this is not good enough and need a test case to show the future people why this won't work.

DrDomenicoMarson commented 1 year ago

Yeah, now I understand better, but I think that the file that needs to be added should have somewhere in the format "field =value" and/or "field=value" and/or "field= value", as these are the cases that are not covered by the current tests. I agree this should be added, I didn't get it from the proposed test file :-)

xiki-tempula commented 1 year ago

but I think that the file that needs to be added should have somewhere in the format "field =value" and/or "field=value" and/or "field= value", as these are the cases that are not covered by the current tests.

Exactly, sorry for the confusion, I just find a random amber output file in my hand to show that in this case, one don't need the whole amber output file and a few lines showing the important bit is fine.

DrDomenicoMarson commented 1 year ago

Great, I added PR 86 in alchemtest, adding the two files needed to cover the two cases: a "field=value" without spaces around the "=", and I think it would be good to raise an error when no "starting time" can be read, instead of just failing down the line, so I added the relevant test and files.

xiki-tempula commented 1 year ago

Many thanks for the fix, so for the PR which involves the edits in alchemtest, the recommended practice is to change the line

https://github.com/alchemistry/alchemlyb/blob/master/.github/workflows/ci.yaml#L58

to your PR in alchemtest https://github.com/DrDomenicoMarson/alchemtestDM/archive/refs/heads/master.zip

Then the test should all pass and you will ask for a review of the PR in both alchemtest and alchemlyb. This should make the review process much easier. For this PR you would need a change log as well.

DrDomenicoMarson commented 1 year ago

I tried to follow the suggestion, I hope now it will work fine! I added the relevant CHANGELOG, but I don't know if the format I used is correct.

xiki-tempula commented 1 year ago

Sorry didn't realise that the output is a dictionary, you could check if both dhdl and u_nk are dataframe