choderalab / perses

Experiments with expanded ensembles to explore chemical space
http://perses.readthedocs.io
MIT License
178 stars 50 forks source link

`out-solvent.nc` is larger than `out-complex.nc` #1172

Closed JenkeScheen closed 1 year ago

JenkeScheen commented 1 year ago

When running perses on a protein-ligand system we've noticed that the out-solvent.nc is larger than out-complex.nc:

128M    P0240-His41(0)-Cys145(0)-His163(0)-3-5/out-solvent.nc
80M P0240-His41(0)-Cys145(0)-His163(0)-3-5/out-complex.nc

This seems like a bug which when resolved could save a lot of storage

JenkeScheen commented 1 year ago

(copying in some of @jchodera's notes):

It’s not obvious what object is the culprit:


$ ncdump -h out-solvent.nc
netcdf out-solvent {
dimensions:
scalar = 1 ;
iteration = UNLIMITED ; // (2536 currently)
spatial = 3 ;
analysis_particles = 31 ;
fixedL55052 = 55052 ;
fixedL1052 = 1052 ;
fixedL1047 = 1047 ;
fixedL1040 = 1040 ;
fixedL1045 = 1045 ;
fixedL1042 = 1042 ;
fixedL935 = 935 ;
fixedL24849 = 24849 ;
fixedL25035 = 25035 ;
fixedL3 = 3 ;
atom = 31 ;
replica = 18 ;
state = 18 ;
unsampled = 2 ;
variables:
int64 last_iteration(scalar) ;
int64 analysis_particle_indices(analysis_particles) ;
analysis_particle_indices:long_name = "analysis_particle_indices[analysis_particles] is the indices of the particles with extra information stored about them in theanalysis file." ;
string options(scalar) ;
char metadata(fixedL3) ;
float positions(iteration, replica, atom, spatial) ;
positions:units = "nm" ;
positions:long_name = "positions[iteration][replica][atom][spatial] is position of coordinate \'spatial\' of atom \'atom\' from replica \'replica\' for iteration \'iteration\'." ;
float velocities(iteration, replica, atom, spatial) ;
velocities:units = "nm / ps" ;
velocities:long_name = "velocities[iteration][replica][atom][spatial] is velocity of coordinate \'spatial\' of atom \'atom\' from replica \'replica\' for iteration \'iteration\'." ;
float box_vectors(iteration, replica, spatial, spatial) ;
box_vectors:units = "nm" ;
box_vectors:long_name = "box_vectors[iteration][replica][i][j] is dimension j of box vector i for replica \'replica\' from iteration \'iteration-1\'." ;
double volumes(iteration, replica) ;
volumes:units = "nm**3" ;
volumes:long_name = "volume[iteration][replica] is the box volume for replica \'replica\' from iteration \'iteration-1\'." ;
int states(iteration, replica) ;
states:units = "none" ;
states:long_name = "states[iteration][replica] is the thermodynamic state index (0..n_states-1) of replica \'replica\' of iteration \'iteration\'." ;
double energies(iteration, replica, state) ;
energies:units = "kT" ;
energies:long_name = "energies[iteration][replica][state] is the reduced (unitless) energy of replica \'replica\' from iteration \'iteration\' evaluated at the thermodynamic state \'state\'." ;
byte neighborhoods(iteration, replica, state) ;
neighborhoods:_FillValue = 1b ;
neighborhoods:long_name = "neighborhoods[iteration][replica][state] is 1 if this energy was computed during this iteration." ;
double unsampled_energies(iteration, replica, unsampled) ;
unsampled_energies:units = "kT" ;
unsampled_energies:long_name = "unsampled_energies[iteration][replica][state] is the reduced (unitless) energy of replica \'replica\' from iteration \'iteration\' evaluated at unsampled thermodynamic state \'state\'." ;
int accepted(iteration, state, state) ;
accepted:units = "none" ;
accepted:long_name = "accepted[iteration][i][j] is the number of proposed transitions between states i and j from iteration \'iteration-1\'." ;
int proposed(iteration, state, state) ;
proposed:units = "none" ;
proposed:long_name = "proposed[iteration][i][j] is the number of proposed transitions between states i and j from iteration \'iteration-1\'." ;
string timestamp(iteration) ;

// global attributes: :UUID = "9768ef57-c781-4aa3-bb6b-6c25f9de3a06" ; :application = "YANK" ; :program = "yank.py" ; :programVersion = "0.21.5" ; :Conventions = "ReplicaExchange" ; :ConventionVersion = "0.2" ; :DataUsedFor = "analysis" ; :CheckpointInterval = 250LL ; :title = "Replica-exchange sampler simulation created using ReplicaExchangeSampler class of openmmtools.multistate on Fri Mar 17 10:56:53 2023" ;

group: thermodynamic_states { variables: char state0(fixedL55052) ; char state1(fixedL1052) ; char state2(fixedL1047) ; char state3(fixedL1040) ; char state4(fixedL1047) ; char state5(fixedL1045) ; char state6(fixedL1040) ; char state7(fixedL1040) ; char state8(fixedL1045) ; char state9(fixedL1042) ; char state10(fixedL1042) ; char state11(fixedL1040) ; char state12(fixedL1040) ; char state13(fixedL1040) ; char state14(fixedL1040) ; char state15(fixedL1040) ; char state16(fixedL1040) ; char state17(fixedL935) ; } // group thermodynamic_states

group: unsampled_states { variables: char state0(fixedL24849) ; char state1(fixedL25035) ; } // group unsampled_states

group: mcmc_moves { variables: string move0(scalar) ; string move1(scalar) ; string move2(scalar) ; string move3(scalar) ; string move4(scalar) ; string move5(scalar) ; string move6(scalar) ; string move7(scalar) ; string move8(scalar) ; string move9(scalar) ; string move10(scalar) ; string move11(scalar) ; string move12(scalar) ; string move13(scalar) ; string move14(scalar) ; string move15(scalar) ; string move16(scalar) ; string move17(scalar) ; } // group mcmc_moves

group: online_analysis { dimensions: dim_size18 = 18 ; dim_size2 = 2 ; dim_size20 = 20 ; variables: double f_k(dim_size18) ; double free_energy(dim_size2) ; double f_k_history(iteration, dim_size18) ; double free_energy_history(iteration, dim_size2) ; double f_k_offline(dim_size20) ; double f_k_offline_history(iteration, dim_size20) ; } // group online_analysis }


>We might need to figure out whether `ncdump` or another tool can better inspect the contents. `ncdump` can also serialize to other formats---might need to do that and inspect whether there is a big blob there.

>It’s also possible it could be something in the `MultiStateReporter` that automatically sets the chunk size, and ends up making the file huge for no reason.
There’s stuff in there that sets `chunksize` to enlarge the NetCDF file to leave room whenever it grows a variable in the extensible dimension, e.g.
https://github.com/choderalab/openmmtools/blob/main/openmmtools/multistate/multistatereporter.py#L1358

>It’s possible we are setting this way too high for something in the solvent file and it’s literally expanding the NetCDF file with blank data.

>The chunksizes are supposed to be set just large enough to not incur a performance hit in reading and writing by continually having to resize the NetCDF file and add more “continued over here” sections
JenkeScheen commented 1 year ago

looking into the edges for this run a bit more closely this seems to have been a one-off behaviour, see /ifs/rtsia01/chodera/asap/asap/project_support/2023/03_jguven/perses/perses.