Open JenkeScheen opened 1 year ago
@JenkeScheen The biggest chunk are probably the nc files, we would benefit from having the header information of these (output of ncdump -h
). Can you provide an example of it for a large nc file?
Here's the header for an out-complex.nc
of 7.7G at /lila/data/chodera/asap-datasets/prospective/2023/03_mers_retro/3_afe_calcs/fauxalysis/perses/P0240-His41(0)-Cys145(0)-His163(0)-1-3
:
netcdf out-complex {
dimensions:
scalar = 1 ;
iteration = UNLIMITED ; // (4807 currently)
spatial = 3 ;
analysis_particles = 4784 ;
fixedL2590814 = 2590814 ;
fixedL1052 = 1052 ;
fixedL1047 = 1047 ;
fixedL1040 = 1040 ;
fixedL1045 = 1045 ;
fixedL1042 = 1042 ;
fixedL935 = 935 ;
fixedL1311858 = 1311858 ;
fixedL1311862 = 1311862 ;
fixedL3 = 3 ;
atom = 4784 ;
replica = 18 ;
state = 18 ;
unsampled = 2 ;
variables:
int64 last_iteration(scalar) ;
int64 analysis_particle_indices(analysis_particles) ;
analysis_particle_indices:long_name = "analysis_particle_indices[analysis_particles] is the indices of the particles with extra information stored about them in theanalysis file." ;
string options(scalar) ;
char metadata(fixedL3) ;
float positions(iteration, replica, atom, spatial) ;
positions:units = "nm" ;
positions:long_name = "positions[iteration][replica][atom][spatial] is position of coordinate \'spatial\' of atom \'atom\' from replica \'replica\' for iteration \'iteration\'." ;
float velocities(iteration, replica, atom, spatial) ;
velocities:units = "nm / ps" ;
velocities:long_name = "velocities[iteration][replica][atom][spatial] is velocity of coordinate \'spatial\' of atom \'atom\' from replica \'replica\' for iteration \'iteration\'." ;
float box_vectors(iteration, replica, spatial, spatial) ;
box_vectors:units = "nm" ;
box_vectors:long_name = "box_vectors[iteration][replica][i][j] is dimension j of box vector i for replica \'replica\' from iteration \'iteration-1\'." ;
double volumes(iteration, replica) ;
volumes:units = "nm**3" ;
volumes:long_name = "volume[iteration][replica] is the box volume for replica \'replica\' from iteration \'iteration-1\'." ;
int states(iteration, replica) ;
states:units = "none" ;
states:long_name = "states[iteration][replica] is the thermodynamic state index (0..n_states-1) of replica \'replica\' of iteration \'iteration\'." ;
double energies(iteration, replica, state) ;
energies:units = "kT" ;
energies:long_name = "energies[iteration][replica][state] is the reduced (unitless) energy of replica \'replica\' from iteration \'iteration\' evaluated at the thermodynamic state \'state\'." ;
byte neighborhoods(iteration, replica, state) ;
neighborhoods:_FillValue = 1b ;
neighborhoods:long_name = "neighborhoods[iteration][replica][state] is 1 if this energy was computed during this iteration." ;
double unsampled_energies(iteration, replica, unsampled) ;
unsampled_energies:units = "kT" ;
unsampled_energies:long_name = "unsampled_energies[iteration][replica][state] is the reduced (unitless) energy of replica \'replica\' from iteration \'iteration\' evaluated at unsampled thermodynamic state \'state\'." ;
int accepted(iteration, state, state) ;
accepted:units = "none" ;
accepted:long_name = "accepted[iteration][i][j] is the number of proposed transitions between states i and j from iteration \'iteration-1\'." ;
int proposed(iteration, state, state) ;
proposed:units = "none" ;
proposed:long_name = "proposed[iteration][i][j] is the number of proposed transitions between states i and j from iteration \'iteration-1\'." ;
string timestamp(iteration) ;
// global attributes:
:UUID = "148f4bdc-0dce-40ae-b09c-a491c679d4fc" ;
:application = "YANK" ;
:program = "yank.py" ;
:programVersion = "0.21.5" ;
:Conventions = "ReplicaExchange" ;
:ConventionVersion = "0.2" ;
:DataUsedFor = "analysis" ;
:CheckpointInterval = 250LL ;
:title = "Replica-exchange sampler simulation created using ReplicaExchangeSampler class of openmmtools.multistate on Sat Mar 18 01:38:28 2023" ;
group: thermodynamic_states {
variables:
char state0(fixedL2590814) ;
char state1(fixedL1052) ;
char state2(fixedL1047) ;
char state3(fixedL1040) ;
char state4(fixedL1047) ;
char state5(fixedL1045) ;
char state6(fixedL1040) ;
char state7(fixedL1040) ;
char state8(fixedL1045) ;
char state9(fixedL1042) ;
char state10(fixedL1042) ;
char state11(fixedL1040) ;
char state12(fixedL1040) ;
char state13(fixedL1040) ;
char state14(fixedL1040) ;
char state15(fixedL1040) ;
char state16(fixedL1040) ;
char state17(fixedL935) ;
} // group thermodynamic_states
group: unsampled_states {
variables:
char state0(fixedL1311858) ;
char state1(fixedL1311862) ;
} // group unsampled_states
group: mcmc_moves {
variables:
string move0(scalar) ;
string move1(scalar) ;
string move2(scalar) ;
string move3(scalar) ;
string move4(scalar) ;
string move5(scalar) ;
string move6(scalar) ;
string move7(scalar) ;
string move8(scalar) ;
string move9(scalar) ;
string move10(scalar) ;
string move11(scalar) ;
string move12(scalar) ;
string move13(scalar) ;
string move14(scalar) ;
string move15(scalar) ;
string move16(scalar) ;
string move17(scalar) ;
} // group mcmc_moves
group: online_analysis {
dimensions:
dim_size18 = 18 ;
dim_size2 = 2 ;
dim_size20 = 20 ;
variables:
double f_k(dim_size18) ;
double free_energy(dim_size2) ;
double f_k_history(iteration, dim_size18) ;
double free_energy_history(iteration, dim_size2) ;
double f_k_offline(dim_size20) ;
double f_k_offline_history(iteration, dim_size20) ;
} // group online_analysis
}
@JenkeScheen I don't see anything terribly wrong with it, ~only that you might want to review the checkpoint interval, I can see it is set to 50 and it might be a bit too frequent? We commonly use 250 for our benchmarks which also run 5ns/replica~.
@jchodera maybe you can spot something else here?
EDIT: Check next comment.
@JenkeScheen Oh actually, never mind that previous comment, I mixed the files so I am actually using 50 and you are using 250. That should be okay. Sorry for the noise.
@JenkeScheen I was thinking again about this, and I think it makes sense that you have such big nc files, at least compared to what we commonly get running benchmarks.
I don't really know how exactly the information is stored in the netCDF format, but I'm going to guess that the fundamental types are IEEE-754 standard C types (that is, float is a 32 bit data type, for practical purposes). Here is a quick comparison:
System | atoms | iterations | replicas | spatial | GBytes of info | nc file size |
---|---|---|---|---|---|---|
Jenke | 4784 | 4807 | 18 | 3 | 9.93 | 7.7G |
tyk2 | 4783 | 2238 | 12 | 3 | 3.08 | 2.4G |
If we compute the ratio of the numbers in the GBytes of info
column, we get 9.93/3.08 = 3.22, which is very close to the ratio between the numbers in the nc file size
, which is 7.7/2.4 = 3.21. This to me means that you are not storing any extra or undesired data compared to what we already store in the benchmarks. I hope this makes sense and helps.
NOTES:
GBytes of info
is computed as 2*iteration*replica*atom*spatial*32/8/1e9
. 2 for velocities and positions, 32 for 32 bits, 8 for 8 bits per byte, 1e9 for 1e9bytes per GB.thanks @ijpulidos, IIRC the .nc
files are used for calculating energies - is the entire file needed for that or would a truncated file be enough? I've dealt with similar filetypes before in other FECs codes but they never really exceeded > a few MBs..
From discussions on our dev syncs what we want to do here for now is changing the default to NOT store any special atom indices in the analysis .nc files, this should lower the size of the output by a considerable amount. This is done in the changes in #1185
When running
perses
on a protein-ligand system we've noticed that thehybrid factory
of edges is huge:Instead of just serializing out this object we need to come up with a data model that contains the essential information associated with a calculation.
Settings for the above calculation:
This is with
0.10.1 : pyha21a80b_1