need a data model that contains essential information of a calculation

JenkeScheen commented 1 year ago

When running perses on a protein-ligand system we've noticed that the hybrid factory of edges is huge:

166M    P0240-His41(0)-Cys145(0)-His163(0)-3-5/out-hybrid_factory.npy.npz

Instead of just serializing out this object we need to come up with a data model that contains the essential information associated with a calculation.

Settings for the above calculation:

n_cycles: 5000
n_steps_per_move_application: 500
n_states: 18
timestep: 2 fs

This is with 0.10.1 : pyha21a80b_1

ijpulidos commented 1 year ago

@JenkeScheen The biggest chunk are probably the nc files, we would benefit from having the header information of these (output of ncdump -h). Can you provide an example of it for a large nc file?

JenkeScheen commented 1 year ago

Here's the header for an out-complex.nc of 7.7G at /lila/data/chodera/asap-datasets/prospective/2023/03_mers_retro/3_afe_calcs/fauxalysis/perses/P0240-His41(0)-Cys145(0)-His163(0)-1-3:

netcdf out-complex {
dimensions:
    scalar = 1 ;
    iteration = UNLIMITED ; // (4807 currently)
    spatial = 3 ;
    analysis_particles = 4784 ;
    fixedL2590814 = 2590814 ;
    fixedL1052 = 1052 ;
    fixedL1047 = 1047 ;
    fixedL1040 = 1040 ;
    fixedL1045 = 1045 ;
    fixedL1042 = 1042 ;
    fixedL935 = 935 ;
    fixedL1311858 = 1311858 ;
    fixedL1311862 = 1311862 ;
    fixedL3 = 3 ;
    atom = 4784 ;
    replica = 18 ;
    state = 18 ;
    unsampled = 2 ;
variables:
    int64 last_iteration(scalar) ;
    int64 analysis_particle_indices(analysis_particles) ;
        analysis_particle_indices:long_name = "analysis_particle_indices[analysis_particles] is the indices of the particles with extra information stored about them in theanalysis file." ;
    string options(scalar) ;
    char metadata(fixedL3) ;
    float positions(iteration, replica, atom, spatial) ;
        positions:units = "nm" ;
        positions:long_name = "positions[iteration][replica][atom][spatial] is position of coordinate \'spatial\' of atom \'atom\' from replica \'replica\' for iteration \'iteration\'." ;
    float velocities(iteration, replica, atom, spatial) ;
        velocities:units = "nm / ps" ;
        velocities:long_name = "velocities[iteration][replica][atom][spatial] is velocity of coordinate \'spatial\' of atom \'atom\' from replica \'replica\' for iteration \'iteration\'." ;
    float box_vectors(iteration, replica, spatial, spatial) ;
        box_vectors:units = "nm" ;
        box_vectors:long_name = "box_vectors[iteration][replica][i][j] is dimension j of box vector i for replica \'replica\' from iteration \'iteration-1\'." ;
    double volumes(iteration, replica) ;
        volumes:units = "nm**3" ;
        volumes:long_name = "volume[iteration][replica] is the box volume for replica \'replica\' from iteration \'iteration-1\'." ;
    int states(iteration, replica) ;
        states:units = "none" ;
        states:long_name = "states[iteration][replica] is the thermodynamic state index (0..n_states-1) of replica \'replica\' of iteration \'iteration\'." ;
    double energies(iteration, replica, state) ;
        energies:units = "kT" ;
        energies:long_name = "energies[iteration][replica][state] is the reduced (unitless) energy of replica \'replica\' from iteration \'iteration\' evaluated at the thermodynamic state \'state\'." ;
    byte neighborhoods(iteration, replica, state) ;
        neighborhoods:_FillValue = 1b ;
        neighborhoods:long_name = "neighborhoods[iteration][replica][state] is 1 if this energy was computed during this iteration." ;
    double unsampled_energies(iteration, replica, unsampled) ;
        unsampled_energies:units = "kT" ;
        unsampled_energies:long_name = "unsampled_energies[iteration][replica][state] is the reduced (unitless) energy of replica \'replica\' from iteration \'iteration\' evaluated at unsampled thermodynamic state \'state\'." ;
    int accepted(iteration, state, state) ;
        accepted:units = "none" ;
        accepted:long_name = "accepted[iteration][i][j] is the number of proposed transitions between states i and j from iteration \'iteration-1\'." ;
    int proposed(iteration, state, state) ;
        proposed:units = "none" ;
        proposed:long_name = "proposed[iteration][i][j] is the number of proposed transitions between states i and j from iteration \'iteration-1\'." ;
    string timestamp(iteration) ;

// global attributes:
        :UUID = "148f4bdc-0dce-40ae-b09c-a491c679d4fc" ;
        :application = "YANK" ;
        :program = "yank.py" ;
        :programVersion = "0.21.5" ;
        :Conventions = "ReplicaExchange" ;
        :ConventionVersion = "0.2" ;
        :DataUsedFor = "analysis" ;
        :CheckpointInterval = 250LL ;
        :title = "Replica-exchange sampler simulation created using ReplicaExchangeSampler class of openmmtools.multistate on Sat Mar 18 01:38:28 2023" ;

group: thermodynamic_states {
  variables:
    char state0(fixedL2590814) ;
    char state1(fixedL1052) ;
    char state2(fixedL1047) ;
    char state3(fixedL1040) ;
    char state4(fixedL1047) ;
    char state5(fixedL1045) ;
    char state6(fixedL1040) ;
    char state7(fixedL1040) ;
    char state8(fixedL1045) ;
    char state9(fixedL1042) ;
    char state10(fixedL1042) ;
    char state11(fixedL1040) ;
    char state12(fixedL1040) ;
    char state13(fixedL1040) ;
    char state14(fixedL1040) ;
    char state15(fixedL1040) ;
    char state16(fixedL1040) ;
    char state17(fixedL935) ;
  } // group thermodynamic_states

group: unsampled_states {
  variables:
    char state0(fixedL1311858) ;
    char state1(fixedL1311862) ;
  } // group unsampled_states

group: mcmc_moves {
  variables:
    string move0(scalar) ;
    string move1(scalar) ;
    string move2(scalar) ;
    string move3(scalar) ;
    string move4(scalar) ;
    string move5(scalar) ;
    string move6(scalar) ;
    string move7(scalar) ;
    string move8(scalar) ;
    string move9(scalar) ;
    string move10(scalar) ;
    string move11(scalar) ;
    string move12(scalar) ;
    string move13(scalar) ;
    string move14(scalar) ;
    string move15(scalar) ;
    string move16(scalar) ;
    string move17(scalar) ;
  } // group mcmc_moves

group: online_analysis {
  dimensions:
    dim_size18 = 18 ;
    dim_size2 = 2 ;
    dim_size20 = 20 ;
  variables:
    double f_k(dim_size18) ;
    double free_energy(dim_size2) ;
    double f_k_history(iteration, dim_size18) ;
    double free_energy_history(iteration, dim_size2) ;
    double f_k_offline(dim_size20) ;
    double f_k_offline_history(iteration, dim_size20) ;
  } // group online_analysis
}

ijpulidos commented 1 year ago

@JenkeScheen I don't see anything terribly wrong with it, ~only that you might want to review the checkpoint interval, I can see it is set to 50 and it might be a bit too frequent? We commonly use 250 for our benchmarks which also run 5ns/replica~.

@jchodera maybe you can spot something else here?

EDIT: Check next comment.

ijpulidos commented 1 year ago

@JenkeScheen Oh actually, never mind that previous comment, I mixed the files so I am actually using 50 and you are using 250. That should be okay. Sorry for the noise.

ijpulidos commented 1 year ago

@JenkeScheen I was thinking again about this, and I think it makes sense that you have such big nc files, at least compared to what we commonly get running benchmarks.

I don't really know how exactly the information is stored in the netCDF format, but I'm going to guess that the fundamental types are IEEE-754 standard C types (that is, float is a 32 bit data type, for practical purposes). Here is a quick comparison:

System	atoms	iterations	replicas	spatial	GBytes of info	nc file size
Jenke	4784	4807	18	3	9.93	7.7G
tyk2	4783	2238	12	3	3.08	2.4G

If we compute the ratio of the numbers in the GBytes of info column, we get 9.93/3.08 = 3.22, which is very close to the ratio between the numbers in the nc file size, which is 7.7/2.4 = 3.21. This to me means that you are not storing any extra or undesired data compared to what we already store in the benchmarks. I hope this makes sense and helps.

NOTES:

GBytes of info is computed as 2*iteration*replica*atom*spatial*32/8/1e9. 2 for velocities and positions, 32 for 32 bits, 8 for 8 bits per byte, 1e9 for 1e9bytes per GB.
These computations were performed only for positions and velocities data, since I think those are the big chunk of the data stored. I know there are more variables but those are probably way smaller.
The nc file size being smaller is probably due to some compression algorithms in the netCDF format.

JenkeScheen commented 1 year ago

thanks @ijpulidos, IIRC the .nc files are used for calculating energies - is the entire file needed for that or would a truncated file be enough? I've dealt with similar filetypes before in other FECs codes but they never really exceeded > a few MBs..

ijpulidos commented 1 year ago

From discussions on our dev syncs what we want to do here for now is changing the default to NOT store any special atom indices in the analysis .nc files, this should lower the size of the output by a considerable amount. This is done in the changes in #1185

choderalab / perses

need a data model that contains essential information of a calculation #1171