Closed mikemhenry closed 1 month ago
Hello @mikemhenry! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
industry_benchmarks/utils/results_cleanup.py
:Line 50:80: E501 line too long (85 > 79 characters) Line 65:80: E501 line too long (92 > 79 characters) Line 66:80: E501 line too long (89 > 79 characters) Line 67:80: E501 line too long (86 > 79 characters) Line 72:80: E501 line too long (92 > 79 characters) Line 75:80: E501 line too long (96 > 79 characters) Line 76:80: E501 line too long (92 > 79 characters) Line 81:80: E501 line too long (83 > 79 characters) Line 92:80: E501 line too long (93 > 79 characters) Line 123:80: E501 line too long (84 > 79 characters) Line 150:80: E501 line too long (83 > 79 characters) Line 165:80: E501 line too long (88 > 79 characters) Line 194:80: E501 line too long (89 > 79 characters) Line 243:80: E501 line too long (124 > 79 characters) Line 268:80: E501 line too long (83 > 79 characters) Line 296:80: E501 line too long (85 > 79 characters) Line 321:80: E501 line too long (81 > 79 characters) Line 328:80: E501 line too long (85 > 79 characters) Line 331:80: E501 line too long (83 > 79 characters) Line 340:80: E501 line too long (85 > 79 characters) Line 358:80: E501 line too long (81 > 79 characters) Line 366:80: E501 line too long (84 > 79 characters) Line 385:80: E501 line too long (85 > 79 characters) Line 386:80: E501 line too long (85 > 79 characters) Line 414:80: E501 line too long (81 > 79 characters) Line 422:80: E501 line too long (88 > 79 characters)
See issue #48; check if this PR will close that issue
Don't forget we need to extract out system information - number of atoms as a minimum
What else would you like to see and where do you want it saved? like a info.yaml
in the results directory, like next to the structural_analysis_data.npz
file?
I'm not sure if we can get the full PDB back out of the checkpoint NC files
I can check to see how data we are saving with
del results["protocol_result"]["data"][result_key][0]["inputs"]["stateA"]
del results["protocol_result"]["data"][result_key][0]["inputs"]["stateB"]
del results["protocol_result"]["data"][result_key][0]["inputs"]["ligandmapping"]
since it might not be worth it then (but do we need to reconstruct the inputs? with the benchmarking project, don't we have the inputs pretty well documented?
I noticed we save structural analysis as json: https://github.com/OpenFreeEnergy/openfe/blob/ecc8a50d07a35c42cd9ccb5e6724a85345a9873a/openfe/protocols/openmm_rfe/equil_rfe_methods.py#L1097
I am thinking we should remove the file, in my example it is 53M, but saved as a compressed numpy array it is 9.3M
Sorry for the delayed response @mikemhenry
What else would you like to see and where do you want it saved? like a info.yaml in the results directory, like next to the structural_analysis_data.npz file?
Yeah that sounds good - and from the looks of it, number of atoms is the only thing we can get out of the checkpoint. If you can read the ns/day or the walltime and dump it in the info file that would be great too.
By the way here's the code necessary to get the number of atoms out:
from openmmtools import multistate.
reporter = multistate.MultiStateReporter(storage='lig_12_lig_13_vacuum.nc', checkpoint_storage='lig_12_lig_13_vacuum_checkpoint.nc', open_mode='r')
print(reporter._storage_checkpoint.dimensions['atom'].size)
Output: 27
since it might not be worth it then (but do we need to reconstruct the inputs? with the benchmarking project, don't we have the inputs pretty well documented?
No those are fine to be deleted, that's not what I meant re: PDB file. I meant the fully solvated PDB, which we don't store anywhere as far as I'm aware (we store the states but not the coordinates).
Let's forget about this one.
No worries! And that all sounds good.
Yeah that sounds good - and from the looks of it, number of atoms is the only thing we can get out of the checkpoint. If you can read the ns/day or the walltime and dump it in the info file that would be great too.
that information lives in simulation_real_time_analysis.yaml
- iteration: 20000
percent_complete: 100.0
mbar_analysis:
free_energy_in_kT: 46.7725064187464
standard_error_in_kT: 0.05886804015168147
number_of_uncorrelated_samples: 3640.321044921875
n_equilibrium_iterations: 5001
statistical_inefficiency: 4.120790481567383
timing_data:
iteration_seconds: 5.147452116012573
average_seconds_per_iteration: 8.255591836331623
estimated_time_remaining: '0:00:08.255592'
estimated_localtime_finish_date: 2024-Mar-24-20:40:30
estimated_total_time: 1 day, 21:51:51.836727
ns_per_day: 230.24394103822635
Do you want me to yank out the last ns_per_day
entry and put it info.yaml
?
also the script now shrinks the result.json file (in my example) from 107M to 81K, and the results directory goes from 40GB to 97M
@mikemhenry
Do you want me to yank out the last ns_per_day entry and put it info.yaml?
Yes please!
also the script now shrinks the result.json file (in my example) from 107M to 81K, and the results directory goes from 40GB to 97M
Amazing - that's wonderful news!
example of info.yaml
$ cat lig_44_solvent_lig_45_solvent_solvent/shared_RelativeHybridTopologyProtocolUnit-586439187ca84b9aad6e9fb5a3cd04dc_attempt_0/info.yaml
n_atoms: 1255
ns_per_day: 1923.789875190101
(a round()
would make this look better (and be correct) but I figure that whoever wants to use this data rather have the raw (even if nonsensical) data)
@IAlibay ready for review!
fixes #48
I do need to test to make sure gather still works!
Add documentation or warning about running multiple instances at once
Add a note about running it as part of your submission script
Test what happens if you point it at an input json. Find a unique input key to test for.
That is a good point, I should summarize what gets removed and what gets trimmed down.
@hannahbaumann @IAlibay this is ready for re-review!
Thanks @mikemhenry, will try to review tomorrow ahead of our meeting.
Input submission checklist
input_structures/prepared_structures/<set_name>/<system_name>
For each submitted directory:
PREPARATION_DETAILS.md
fileprotein.pdb
cofactors.sdf
ligands.sdf
and removed any duplicate binding modes or protonation states, keeping the more favorable stateSummary of changes
Licensing agreement