Draft cleanup results script

mikemhenry commented 3 months ago

Input submission checklist

[ ] created a directory for each system based on the template directory
[ ] named and placed the directory with the following pattern: input_structures/prepared_structures/<set_name>/<system_name>

For each submitted directory:

[ ] filled in system preparation details in the PREPARATION_DETAILS.md file
[ ] added a PDB file with the protein named protein.pdb
[ ] removed any cofactors from the PDB file and placed them in cofactors.sdf
[ ] kept cystallographic waters and metals in the PDB file
[ ] tested the PDB and, if available, cofactors.sdf file using the input validation script
[ ] copied over the ligands SDF file to a file named ligands.sdf and removed any duplicate binding modes or protonation states, keeping the more favorable state

Summary of changes

Licensing agreement

[x] I agree to release these inputs under the combined MIT and CCBY SA 4.0 licenses of this repository

pep8speaks commented 3 months ago

Hello @mikemhenry! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file industry_benchmarks/utils/results_cleanup.py:

Line 50:80: E501 line too long (85 > 79 characters) Line 65:80: E501 line too long (92 > 79 characters) Line 66:80: E501 line too long (89 > 79 characters) Line 67:80: E501 line too long (86 > 79 characters) Line 72:80: E501 line too long (92 > 79 characters) Line 75:80: E501 line too long (96 > 79 characters) Line 76:80: E501 line too long (92 > 79 characters) Line 81:80: E501 line too long (83 > 79 characters) Line 92:80: E501 line too long (93 > 79 characters) Line 123:80: E501 line too long (84 > 79 characters) Line 150:80: E501 line too long (83 > 79 characters) Line 165:80: E501 line too long (88 > 79 characters) Line 194:80: E501 line too long (89 > 79 characters) Line 243:80: E501 line too long (124 > 79 characters) Line 268:80: E501 line too long (83 > 79 characters) Line 296:80: E501 line too long (85 > 79 characters) Line 321:80: E501 line too long (81 > 79 characters) Line 328:80: E501 line too long (85 > 79 characters) Line 331:80: E501 line too long (83 > 79 characters) Line 340:80: E501 line too long (85 > 79 characters) Line 358:80: E501 line too long (81 > 79 characters) Line 366:80: E501 line too long (84 > 79 characters) Line 385:80: E501 line too long (85 > 79 characters) Line 386:80: E501 line too long (85 > 79 characters) Line 414:80: E501 line too long (81 > 79 characters) Line 422:80: E501 line too long (88 > 79 characters)

Comment last updated at 2024-08-12 16:53:47 UTC

jameseastwood commented 3 months ago

See issue #48; check if this PR will close that issue

mikemhenry commented 2 months ago

Don't forget we need to extract out system information - number of atoms as a minimum

What else would you like to see and where do you want it saved? like a info.yaml in the results directory, like next to the structural_analysis_data.npz file?

I'm not sure if we can get the full PDB back out of the checkpoint NC files

I can check to see how data we are saving with

del results["protocol_result"]["data"][result_key][0]["inputs"]["stateA"]
del results["protocol_result"]["data"][result_key][0]["inputs"]["stateB"]
del results["protocol_result"]["data"][result_key][0]["inputs"]["ligandmapping"]

since it might not be worth it then (but do we need to reconstruct the inputs? with the benchmarking project, don't we have the inputs pretty well documented?

mikemhenry commented 2 months ago

I noticed we save structural analysis as json: https://github.com/OpenFreeEnergy/openfe/blob/ecc8a50d07a35c42cd9ccb5e6724a85345a9873a/openfe/protocols/openmm_rfe/equil_rfe_methods.py#L1097

I am thinking we should remove the file, in my example it is 53M, but saved as a compressed numpy array it is 9.3M

IAlibay commented 2 months ago

Sorry for the delayed response @mikemhenry

What else would you like to see and where do you want it saved? like a info.yaml in the results directory, like next to the structural_analysis_data.npz file?

Yeah that sounds good - and from the looks of it, number of atoms is the only thing we can get out of the checkpoint. If you can read the ns/day or the walltime and dump it in the info file that would be great too.

By the way here's the code necessary to get the number of atoms out:

from openmmtools import multistate.
reporter = multistate.MultiStateReporter(storage='lig_12_lig_13_vacuum.nc', checkpoint_storage='lig_12_lig_13_vacuum_checkpoint.nc', open_mode='r')

print(reporter._storage_checkpoint.dimensions['atom'].size)

Output: 27

since it might not be worth it then (but do we need to reconstruct the inputs? with the benchmarking project, don't we have the inputs pretty well documented?

No those are fine to be deleted, that's not what I meant re: PDB file. I meant the fully solvated PDB, which we don't store anywhere as far as I'm aware (we store the states but not the coordinates).

Let's forget about this one.

mikemhenry commented 2 months ago

No worries! And that all sounds good.

Yeah that sounds good - and from the looks of it, number of atoms is the only thing we can get out of the checkpoint. If you can read the ns/day or the walltime and dump it in the info file that would be great too.

that information lives in simulation_real_time_analysis.yaml

- iteration: 20000
  percent_complete: 100.0
  mbar_analysis:
    free_energy_in_kT: 46.7725064187464
    standard_error_in_kT: 0.05886804015168147
    number_of_uncorrelated_samples: 3640.321044921875
    n_equilibrium_iterations: 5001
    statistical_inefficiency: 4.120790481567383
  timing_data:
    iteration_seconds: 5.147452116012573
    average_seconds_per_iteration: 8.255591836331623
    estimated_time_remaining: '0:00:08.255592'
    estimated_localtime_finish_date: 2024-Mar-24-20:40:30
    estimated_total_time: 1 day, 21:51:51.836727
    ns_per_day: 230.24394103822635

Do you want me to yank out the last ns_per_day entry and put it info.yaml?

also the script now shrinks the result.json file (in my example) from 107M to 81K, and the results directory goes from 40GB to 97M

IAlibay commented 2 months ago

@mikemhenry

Do you want me to yank out the last ns_per_day entry and put it info.yaml?

Yes please!

also the script now shrinks the result.json file (in my example) from 107M to 81K, and the results directory goes from 40GB to 97M

Amazing - that's wonderful news!

mikemhenry commented 2 months ago

example of info.yaml

$ cat lig_44_solvent_lig_45_solvent_solvent/shared_RelativeHybridTopologyProtocolUnit-586439187ca84b9aad6e9fb5a3cd04dc_attempt_0/info.yaml 
n_atoms: 1255
ns_per_day: 1923.789875190101

(a round() would make this look better (and be correct) but I figure that whoever wants to use this data rather have the raw (even if nonsensical) data)

mikemhenry commented 2 months ago

@IAlibay ready for review!

mikemhenry commented 2 months ago

fixes #48

mikemhenry commented 2 months ago

I do need to test to make sure gather still works!

jameseastwood commented 2 months ago

Add documentation or warning about running multiple instances at once

jameseastwood commented 2 months ago

Add a note about running it as part of your submission script

jameseastwood commented 2 months ago

Test what happens if you point it at an input json. Find a unique input key to test for.

mikemhenry commented 1 month ago

That is a good point, I should summarize what gets removed and what gets trimmed down.

mikemhenry commented 1 month ago

@hannahbaumann @IAlibay this is ready for re-review!

IAlibay commented 1 month ago

Thanks @mikemhenry, will try to review tomorrow ahead of our meeting.

OpenFreeEnergy / IndustryBenchmarks2024