OpenFreeEnergy / IndustryBenchmarks2024

Repository for the 2024 OpenFE industry benchmark efforts
https://industrybenchmarks2024.readthedocs.io/en/latest/
MIT License
8 stars 18 forks source link

Draft cleanup results script #83

Closed mikemhenry closed 1 month ago

mikemhenry commented 3 months ago

Input submission checklist

For each submitted directory:

Summary of changes

Licensing agreement

pep8speaks commented 3 months ago

Hello @mikemhenry! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 50:80: E501 line too long (85 > 79 characters) Line 65:80: E501 line too long (92 > 79 characters) Line 66:80: E501 line too long (89 > 79 characters) Line 67:80: E501 line too long (86 > 79 characters) Line 72:80: E501 line too long (92 > 79 characters) Line 75:80: E501 line too long (96 > 79 characters) Line 76:80: E501 line too long (92 > 79 characters) Line 81:80: E501 line too long (83 > 79 characters) Line 92:80: E501 line too long (93 > 79 characters) Line 123:80: E501 line too long (84 > 79 characters) Line 150:80: E501 line too long (83 > 79 characters) Line 165:80: E501 line too long (88 > 79 characters) Line 194:80: E501 line too long (89 > 79 characters) Line 243:80: E501 line too long (124 > 79 characters) Line 268:80: E501 line too long (83 > 79 characters) Line 296:80: E501 line too long (85 > 79 characters) Line 321:80: E501 line too long (81 > 79 characters) Line 328:80: E501 line too long (85 > 79 characters) Line 331:80: E501 line too long (83 > 79 characters) Line 340:80: E501 line too long (85 > 79 characters) Line 358:80: E501 line too long (81 > 79 characters) Line 366:80: E501 line too long (84 > 79 characters) Line 385:80: E501 line too long (85 > 79 characters) Line 386:80: E501 line too long (85 > 79 characters) Line 414:80: E501 line too long (81 > 79 characters) Line 422:80: E501 line too long (88 > 79 characters)

Comment last updated at 2024-08-12 16:53:47 UTC
jameseastwood commented 3 months ago

See issue #48; check if this PR will close that issue

mikemhenry commented 2 months ago

Don't forget we need to extract out system information - number of atoms as a minimum

What else would you like to see and where do you want it saved? like a info.yaml in the results directory, like next to the structural_analysis_data.npz file?

I'm not sure if we can get the full PDB back out of the checkpoint NC files

I can check to see how data we are saving with

del results["protocol_result"]["data"][result_key][0]["inputs"]["stateA"]
del results["protocol_result"]["data"][result_key][0]["inputs"]["stateB"]
del results["protocol_result"]["data"][result_key][0]["inputs"]["ligandmapping"]

since it might not be worth it then (but do we need to reconstruct the inputs? with the benchmarking project, don't we have the inputs pretty well documented?

mikemhenry commented 2 months ago

I noticed we save structural analysis as json: https://github.com/OpenFreeEnergy/openfe/blob/ecc8a50d07a35c42cd9ccb5e6724a85345a9873a/openfe/protocols/openmm_rfe/equil_rfe_methods.py#L1097

I am thinking we should remove the file, in my example it is 53M, but saved as a compressed numpy array it is 9.3M

IAlibay commented 2 months ago

Sorry for the delayed response @mikemhenry

What else would you like to see and where do you want it saved? like a info.yaml in the results directory, like next to the structural_analysis_data.npz file?

Yeah that sounds good - and from the looks of it, number of atoms is the only thing we can get out of the checkpoint. If you can read the ns/day or the walltime and dump it in the info file that would be great too.

By the way here's the code necessary to get the number of atoms out:

from openmmtools import multistate.
reporter = multistate.MultiStateReporter(storage='lig_12_lig_13_vacuum.nc', checkpoint_storage='lig_12_lig_13_vacuum_checkpoint.nc', open_mode='r')

print(reporter._storage_checkpoint.dimensions['atom'].size)

Output: 27

since it might not be worth it then (but do we need to reconstruct the inputs? with the benchmarking project, don't we have the inputs pretty well documented?

No those are fine to be deleted, that's not what I meant re: PDB file. I meant the fully solvated PDB, which we don't store anywhere as far as I'm aware (we store the states but not the coordinates).

Let's forget about this one.

mikemhenry commented 2 months ago

No worries! And that all sounds good.

Yeah that sounds good - and from the looks of it, number of atoms is the only thing we can get out of the checkpoint. If you can read the ns/day or the walltime and dump it in the info file that would be great too.

that information lives in simulation_real_time_analysis.yaml

- iteration: 20000
  percent_complete: 100.0
  mbar_analysis:
    free_energy_in_kT: 46.7725064187464
    standard_error_in_kT: 0.05886804015168147
    number_of_uncorrelated_samples: 3640.321044921875
    n_equilibrium_iterations: 5001
    statistical_inefficiency: 4.120790481567383
  timing_data:
    iteration_seconds: 5.147452116012573
    average_seconds_per_iteration: 8.255591836331623
    estimated_time_remaining: '0:00:08.255592'
    estimated_localtime_finish_date: 2024-Mar-24-20:40:30
    estimated_total_time: 1 day, 21:51:51.836727
    ns_per_day: 230.24394103822635

Do you want me to yank out the last ns_per_day entry and put it info.yaml?

also the script now shrinks the result.json file (in my example) from 107M to 81K, and the results directory goes from 40GB to 97M

IAlibay commented 2 months ago

@mikemhenry

Do you want me to yank out the last ns_per_day entry and put it info.yaml?

Yes please!

also the script now shrinks the result.json file (in my example) from 107M to 81K, and the results directory goes from 40GB to 97M

Amazing - that's wonderful news!

mikemhenry commented 2 months ago

example of info.yaml

$ cat lig_44_solvent_lig_45_solvent_solvent/shared_RelativeHybridTopologyProtocolUnit-586439187ca84b9aad6e9fb5a3cd04dc_attempt_0/info.yaml 
n_atoms: 1255
ns_per_day: 1923.789875190101

(a round() would make this look better (and be correct) but I figure that whoever wants to use this data rather have the raw (even if nonsensical) data)

mikemhenry commented 2 months ago

@IAlibay ready for review!

mikemhenry commented 2 months ago

fixes #48

mikemhenry commented 2 months ago

I do need to test to make sure gather still works!

jameseastwood commented 2 months ago

Add documentation or warning about running multiple instances at once

jameseastwood commented 2 months ago

Add a note about running it as part of your submission script

jameseastwood commented 2 months ago

Test what happens if you point it at an input json. Find a unique input key to test for.

mikemhenry commented 1 month ago

That is a good point, I should summarize what gets removed and what gets trimmed down.

mikemhenry commented 1 month ago

@hannahbaumann @IAlibay this is ready for re-review!

IAlibay commented 1 month ago

Thanks @mikemhenry, will try to review tomorrow ahead of our meeting.