[BUG] AMBER time series records are non-deterministic

xiki-tempula commented 1 year ago

Describe the bug So when I run a simulation with BSS.Protocol.Production(report_interval=1), I think I shall get at least runtime/timestep/report_interval number of records in with process.getRecords().

To Reproduce

import BioSimSpace.Sandpit.Exscientia as BSS
mol = BSS.Parameters.openff_unconstrained_2_0_0('C').getMolecule()
protocol = BSS.Protocol.Production(report_interval=1)
process = BSS.Process.Amber(mol.toSystem(), protocol, exe=amber_exe)
process.start()
process.wait()
process.getRecords()

Gives {} Expected behavior I shall get a dictionary in the format of

{'NSTEP': ['0', '278800'],
 'TIME(PS)': ['0.000', '557.600'],
 'TEMP(K)': ['452.66', '293.49'],
 'PRESS': ['0.0', '0.0'],
 'ETOT': ['-14835.3346', '-14835.3346'],
 'EKTOT': ['5845.1152', '5845.1152'],
 'EPTOT': ['-20680.4498', '-20680.4498'],
 'BOND': ['3.1693', '3.1693'],
 'ANGLE': ['18.6614', '18.6614'],
 'DIHED': ['3.0760', '3.0760'],
 '1-4 NB': ['4.2537', '4.2537'],
 '1-4 EEL': ['-0.1536', '-0.1536'],
 'VDWAALS': ['3151.0853', '3151.0853'],
 'EELEC': ['-23860.5419', '-23860.5419'],
 'EHBOND': ['0.0000', '0.0000'],
 'RESTRAINT': ['0.0000', '0.0000'],
 'EKCMT': ['0.0000', '0.0000'],
 'VIRIAL': ['0.0000', '0.0000'],
 'VOLUME': ['64000.0000', '64000.0000'],
 'DENSITY': ['1.0122', '1.0122']}

Where there is runtime/timestep/report_interval number of records for each entry.

(please complete the following information):

OS: Linux
Version of Python: 3.8
Version of BioSimSpace: dev
I confirm that I have checked this bug still exists in the latest released version of BioSimSpace: no

Additional context Add any other context about the problem here.

lohedges commented 1 year ago

It looks like it's simply not updating the record dictionary before returning. Should be an easy fix. Will sort this tomorrow.

lohedges commented 1 year ago

This is actually working for me. Can you check to see if your process returns an error? I assume that you are running this vacuum simullation with PMEMD, in which case the default config options in your sandpit might not be valid for vacuum. (I've mentioned this previously on Slack.)

xiki-tempula commented 1 year ago

I have tried using the native BSS and sander as well. So

import BioSimSpace as BSS
mol = BSS.Parameters.openff_unconstrained_2_0_0('C').getMolecule()
protocol = BSS.Protocol.Production(report_interval=1)
process = BSS.Process.Amber(mol.toSystem(), protocol)
process.start()
process.wait()
records = process.getRecords()
len(records['RESTRAINT'])

This gives me 2501 results If I change the report_interval to 10, I would imagine that I get 10 times less samples.

import BioSimSpace as BSS
mol = BSS.Parameters.openff_unconstrained_2_0_0('C').getMolecule()
protocol = BSS.Protocol.Production(report_interval=10)
process = BSS.Process.Amber(mol.toSystem(), protocol)
process.start()
process.wait()
records = process.getRecords()
len(records['RESTRAINT'])

But I got 2499 records in this case

xiki-tempula commented 1 year ago

I tried again and

import BioSimSpace as BSS
mol = BSS.Parameters.openff_unconstrained_2_0_0('C').getMolecule()
protocol = BSS.Protocol.Production(report_interval=10)
process = BSS.Process.Amber(mol.toSystem(), protocol)
process.start()
process.wait()
records = process.getRecords()
len(records['RESTRAINT'])

Gives me 2501 results this time.

lohedges commented 1 year ago

The watchdog used to handle the automatic file reading can be a little unreliable when the frequency of writes is very large. (It is monitoring changes to a file and updating the dictionary automatically). However, changing restart interval should change the number of records that you get. I'll investigate this.

xiki-tempula commented 1 year ago

Should this be report_interval in this case? For Gromacs, I pinning the frequency of getting the energy by report_interval.

lohedges commented 1 year ago

This is because I've incorrectly updated the frequency of writing energies when adapting the configuration generator from your sandpit. Previously this was pinned to the report_interval for anything other than a Minimisation protocol. Now it is set to 200 regardless.

Will fix this now.

lohedges commented 1 year ago

This is the case in your sandpit too.

xiki-tempula commented 1 year ago

But this still won't explain that the length of record that I got is non-deterministic? If I run the same command a number of times, I got different result each time.

This is the case in your sandpit too.

If you look at my example, if I change to import BioSimSpace as BSS, the problem still persists?

lohedges commented 1 year ago

The watchdog used to handle the automatic file reading can be a little unreliable when the frequency of writes is very large.

xiki-tempula commented 1 year ago

I have changed the line to protocol = BSS.Protocol.Production(report_interval=2000), in this case, I shall get 250 samples

protocol.getRunTime()/protocol.getTimeStep()/protocol.getReportInterval()
250.0

but I still got 2500 samples? I'm using import BioSimSpace as BSS.

lohedges commented 1 year ago

This is because I haven't fixed it yet. I'm doing that now :man_shrugging:

This is because I've incorrectly updated the frequency of writing energies when adapting the configuration generator from your sandpit. Previously this was pinned to the report_interval for anything other than a Minimisation protocol. Now it is set to 200 regardless. Will fix this now.

With my fix I see:

protocol.getRunTime()/protocol.getTimeStep()/protocol.getReportInterval()
250.0

records = process.getRecords()
len(records["ETOT"])
251

xiki-tempula commented 1 year ago

Ok, I thought the problem with report_interval is only in the sandpit.

lohedges commented 1 year ago

This is now merged. I've not updated the sandpit, since there might be a reason why you update at a fixed interval. Feel free to fix this in your own fork if needed.

xiki-tempula commented 1 year ago

Great thanks.

xiki-tempula commented 1 year ago

Is using watchdog to handle this function too risky? I think would it be better to just parse the amber output file? I don't know how reproducible this is but I can get different number of values for different keys.

import BioSimSpace.Sandpit.Exscientia as BSS
mol = BSS.Parameters.gaff('C').getMolecule()
protocol = BSS.Protocol.Production()
process = BSS.Process.Amber(mol.toSystem(), protocol)
process.start()
process.wait()
data = process.getRecords()

In [20]: [(key, len(value)) for key, value in data.items()]
Out[20]: 
[('NSTEP', 533),
 ('TIME(PS)', 533),
 ('TEMP(K)', 533),
 ('PRESS', 533),
 ('ETOT', 531),
 ('EKTOT', 531),
 ('EPTOT', 531),
 ('BOND', 531),
 ('ANGLE', 531),
 ('DIHED', 531),
 ('1-4 NB', 531),
 ('1-4 EEL', 531),
 ('VDWAALS', 531),
 ('EELEC', 531),
 ('EHBOND', 531),
 ('RESTRAINT', 531)]

Note NSTEP doesn't have the same length as the others.

lohedges commented 1 year ago

Yes, I think it should be possible to get the same information from the log file. (I think the same records are always present.) The flakiness of the background watchdog has been a bit variable over time and, since most people have used this functionality interactively, I've not been too concerned. When originally writing the code (a long time ago) I experimented with different approaches for parsing the output files to see what was possible. The nice thing with this approach is that it's fully automated, although relies on correct file event signals from the OS, which can be problematic when the file is being updated rapidly.

Internally, things that use this output, e.g. plotting tools, just truncate to the shortest record, i.e. if you plot a time series of some data it and there is a mismatch between x and y, it will just use the shortest, like zip does. Another option could be to just truncate the records in a similar way within getRecords, but it would probably be best to ensure that they are all present. (Things will still be missing / mismatched if a user calls getRecords interactively, though, since not all records may have been written to the file for the current time step.)

I could just try using the pygtail approach used in some of the other process classes, i.e. that only reads new records written to the log file.

xiki-tempula commented 1 year ago

Thanks.

I could just try using the pygtail approach used in some of the other process classes, i.e. that only reads new records written to the log file.

I think this would be the best.

I still don't understand why there could be a mismatch? In this case, I use it after wait to ensure that everything finishes?

lohedges commented 1 year ago

I still don't understand why there could be a mismatch? In this case, I use it after wait to ensure that everything finishes?

Because the dictionary is being updated automatically in the background by watchdog while the simulation is running. This uses system file event triggers to know when to parse the mdinfo file and update the dictionary. The issue is that these triggers can be inconsistent on different operating systems (Windows in particular) so sometimes it might read twice, or not at all. At present it is set to trigger on any modification, not just a close-write. This was needed, since close-write isn't supported on Windows. I'll try making this conditional to Windows to see if it's more reliable. Here's the code.

For reference, I've run your example 10 times in a row and get the exact same number of outputs for all records each time, so it's clearly doing the correct job on Linux for me. I'll tell you what to try modifying when I've tested.

lohedges commented 1 year ago

It looks like watchdog now supports IN_CLOSE_WRITE. (It didn't use to, since it wasn't available on Windows.) As another workaround, I can just see if NSTEP has changed since the last read and skip if it hasn't.

xiki-tempula commented 1 year ago

So I'm testing it on a OSX system. It is a bit annoying that you cannot reproduce it.

lohedges commented 1 year ago

I can reproduce on an old macBook. Will try to figure out why some results from the header line are duplicated. We already check for repeated NSTEPS entries. Also, IN_CLOSE_WRITE is still not supported. (The PR thread was misleading.) This won't help anyway, since AMBER opens and closes the file once only, flushing in between for new entries.

lohedges commented 1 year ago

I also get a different number of entries on macOS, irrespective of the different in the total numbers for NSTEP, TIME(PS), and TEMP(K). For the others, there are 1051 records (different to yours, so presumably you are using your internal sandpit) which is less than the 2501 on Linux.

I've checked the values for NSTEP and they are unique, but on macOS the spacing between successive records is inconsistent, so it is missing data. I'm not yet sure why this is. Will see if it's possible to fix, otherwise it probably makes sense to switch to parsing the log file instead. (The watchdog repo does have lots of cross-platform checks, so I'm surprised that a simple event trigger is failing like this.)

xiki-tempula commented 1 year ago

@lohedges Thanks. It will greatly help me to be able to read from the amber log. Then I could run the simulation on one aws instance and analysis the data on the next aws instance.

lohedges commented 1 year ago

This is now implemented here. It should be merged later/tomorrow, or feel free to checkout the branch locally and test yourself.

Cheers.

xiki-tempula commented 1 year ago

This is really fast! Thank you. Will test it once it is merged.

lohedges commented 1 year ago

I've just found that the record dictionary is wrong when position restraints are active, since sander (not sure about pmemd) injects an EAMBER (non-restraint) record, but not for each step that is logged. This is just the potential energy minus the restraint energy, which can easily be reconstructed from the other information.

For now I'll just ignore this record. It is fixed on my local feature branch.

OpenBioSim / biosimspace

[BUG] AMBER time series records are non-deterministic #26