Parallelize storage reading/writing

andrrizzi commented 5 years ago

I'm planning to make some changes in the Reporter to split the two monolithic netcdf files into more manageable chunks. This is what I'm thinking now:

Solute trajectory: One xtc file per replica.
Checkpoint trajectory: One xtc file per replica.
Thermodynamic states: YAML (for thermodynamic parameters) and XML (for standard System) files.
MCMC moves: One YAML file per move.
metadata: One or many YAML files.
Constructor options: One or many YAML files.
everything else (energies, mixing statistics, logZ, any_numeric_array_of_variable_dimension): a single netcdf file for all.

I think splitting over multiple small files (whose directory structure is hidden by the Reporter class) means reading operations will be faster. Moreover, we'll be able to parallelize writing on disk, which is currently a big bottleneck for multi-replica methods.

Question: Should we keep the old reporter around and maybe change its name to NetCDFReporter to allow reading data generated with previous versions or do we anticipate installing a previous version of openmmtools/yank will suffice for our needs?

sroet commented 5 years ago

I do not use this personally, so no strong opinion either way.

I did have a small remark on the proposal:

Checkpoint trajectory: One xtc file per replica.

You probably do not want to use a lossy format like xtc (stores positions to about 1e-3 A) for checkpointing trajectories.

andrrizzi commented 5 years ago

You probably do not want to use a lossy format like xtc (stores positions to about 1e-3 A) for checkpointing trajectories.

Thank you so much for your input, @sroet. I wasn't aware of that, and I think it's a fair concern. We may want to save the checkpoint in dcd instead then.

We also need to think about how to handle the restraint reweighting which relies on the solute trajectory. A solution that would let us store it in low-precision would be to add the option to compute and store at runtime distances and energies. Actually, this feature was transferred here from YANK with the multistate module, but maybe we want to handle that only in YANK to keep things simpler here.

kyleabeauchamp commented 5 years ago

Are there parallel backends for netCDF that would also allow parallelism without a refactor?
You might want to copy some ideas from xarray, which is now used by arviz as a backend for MCMC trace storage---it won't have exactly what you need, but maybe the overall structure would be inspring. https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html

jchodera commented 5 years ago

I agree with @sroet that it would be best not to use XTC for checkpoints.

Are the checkpoint files particularly problematic, or could we keep those in NetCDF format?

Also tagging @jaimergp for input.

jchodera commented 5 years ago

Question: Should we keep the old reporter around and maybe change its name to NetCDFReporter to allow reading data generated with previous versions or do we anticipate installing a previous version of openmmtools/yank will suffice for our needs?

That would be ideal if it doesn't become a huge pain to maintain. But you may find that you need to refactor the storage layer API, however, which would require more effort to update the NetCDFStorageLayer.

In particular, the current storage layer API has some issues with either (1) being ultra-specific in exposing specific variable interfaces, rather than general, or (2) missing interfaces for some variables. It might be useful to first evaluate whether the storage layer API needs to be refactored before implementing a new storage class.

jchodera commented 5 years ago

One more consideration: Ideally, for maximum performance, data to be written to disk would be offloaded to another thread to flush to disk asynchronously. That will allow the simulation to continue on the GPU while the disk write is pending. We want to do this in a way that allows us to robustly resume from the last checkpoint.

andrrizzi commented 5 years ago

Thank you very much for the tips, @kyleabeauchamp .

Are there parallel backends for netCDF that would also allow parallelism without a refactor? Are the checkpoint files particularly problematic, or could we keep those in NetCDF format?

That's a good point. I think I've seen netCDF APIs that allow writing in parallel. I'll check if they present extra difficulties in terms of deployment/maintenance. This may help greatly with the trajectories I/O. For the data that doesn't fit the numerical array type (such as object serialization of thermo states, MCMC moves, and metadata), I'd still suggest a refactoring. Our current serialization strategy to netcdf for generic types is quite limited and cumbersome.

One of the advantages I see of splitting the trajectory over multiple files in a common format (including AMBER NetCDF) is that people will be able to just load the replica trajectories in VMD or PyMol instead of having to go through the expensive extract_trajectory() function. "How do I check the trajectory?" is one of the most common questions I hear. I think AMBER NetCDF has a remd_dimension, but I'm not sure common visualization software supports it.

andrrizzi commented 5 years ago

It looks like conda-forge provides HDF5 libraries compiled with MPI enabled as of Nov 2018 (conda-forge/hdf5-feedstock#51) so netcdf with parallel writing should be a feasible option if we decide to go for it.

The only advantage of individual trajectories at this point would be only easy visualization.

For reference: https://unidata.github.io/netcdf4-python/netCDF4/index.html#section13 .

andrrizzi commented 5 years ago

Ah! But conda-forge/libnetcdf is built only with openmpi, not mpich (and only since a couple of months ago) so it may be more complicated than I thought (conda-forge/netcdf4-feedstock#43). It might be worth checking if they can build an mpich variant as well.

andrrizzi commented 5 years ago

I took a look at what it will take to get an mpich variant built on the netcdf4 feedstock. It looks quite easy, and they seem open to external contributions so I'm thinking that right now we just want to leave the format untouched and try parallel writing + playing with the chunk size. I think this will give us a huge speedup for the least effort right now.

If there are no objections, I'd close this issue and open a new one about implementing parallel netcdf4. We can always re-open if it turns out we can improve things from there with one of these strategies.

jchodera commented 5 years ago

If there are no objections, I'd close this issue and open a new one about implementing parallel netcdf4. We can always re-open if it turns out we can improve things from there with one of these strategies.

Can we discuss Monday? I'm not sure parallel NetCDF4 will solve all our problems, and especially not any of the problems of processing NetCDF files to extract trajectories from them afterwards or reducing file sizes. It may not even solve our asynchronous write threads issue either. And it will introduce problems down the road if we move away from mpich as our sole parallelization strategy.

andrrizzi commented 5 years ago

Can we discuss Monday?

Sure! I just found out that parallel netcdf doesn't support neither compression nor chunk caching. This might be ok for the checkpoint trajectory, but it might degrade performance for the solute trajectory if only a few MPI processes are used.

Because we already know replicas are not going to overwrite each other, separating in multiple files the solute trajectory (xtc or netcdf files with lossy compression) might be a good strategy after all. It would also make it possible to read/write with pure Python threads without incurring in HDF5 locking problems.

jaimergp commented 5 years ago

From the usability point of view, one of the comments I have received in my previous projects is that the amount of files generated by the software was too high. This is particularly true with new users - they see lots of files and get overwhelmed because they want to find "their results" easily and (at first) do not care about contextual data meant for further reproducibility.

If we anticipate this to be a problem here (for example, one YAML file per MC move) I have found a combination of solutions to mitigate this "shock" factor. With decreasing relevance:

Of course, document all the output files and their importance. But not everybody will read through the documentation, so...
Categorize the files into folders. Python _private naming conventions could be used if needed.
Include a small autogenerated README.md (or RESULTS.md for even higher impact?) with the output structure and files, together with their meaning.

I am aware that this might not be a problem with established users willing to devote some time to understand what they are doing, but it could help those only willing to try out whether this thing works or not.

andrrizzi commented 5 years ago

Me and @jaimergp had a brief chat. If there are no objections, the current plan is to have @jaimergp implement splitting the solute-only trajectory on multiple xtc file, and re-evaluate the other possible solutions after this first pass.

EDIT: By "re-evalutating the other possible solutions" I mean re-evaluating whether splitting the objects serialization of, for example, thermo states and mcmc move in multiple files is worth the effort, and whether it is a good idea to use parallel netcdf for the checkpoint trajectory and the other netcdf-stored arrays.

andrrizzi commented 5 years ago

For the records, we are going to merge #434, which split solute-only trajectories from the main netcdf and saves them into separate xtc file, in a parallel-writing branch. Right now that PR simply changes the format. We'll implement the parallelization separately after we have verified that the performance for a single MPI process is not degraded on the cluster as well. We think we can further speed up writing by implementing the "append" mode in MDTraj as right now we're simply writing the frame twice (and still obtaining essentially identical performances w.r.t. netcdf).

Olllom commented 4 years ago

Are you still working on this? Without parallel NetCDF, the storage file is such a pain in the neck.

@jaimergp I can see how storing everything in the same file can be advantageous. At the same time, it complicates more advanced analyses.

In general, access to the storage file is too slow. Just a few examples from our recent Hamiltonian replica exchange simulations of membranes.

The hardcoded number of attempts for reading the checkpoints from the storage file is insufficient. Reading checkpoints takes way too long and we had to extend the number of trials by a lot.
Analysis of the results via the storage file is impossible while simulations are running.
Extracting trajectory information from storage file takes hours. The solute-only trajectory is surely an improvement, but especially for membrane systems, we run analyses that use at least the lipid and solute coordinates, sometimes waters, too.

In my opinion, a lot of things would be much easier if each replica would write its own trajectory file. The energies, mixing statistics, thermodynamic states, (maybe checkpoints), could still live in a (much smaller) netcdf file. This would allow easier access to all information. I would also not mind implementing some of the refactoring if you agree that it would make things better.

Or am I missing something obvious?

jchodera commented 4 years ago

I totally agree about the current pain of using the single NetCDF file, and am hoping we can split both the checkpoint files and solute-only files into separate XTC or DCD files, leaving only the smaller numpy arrays in the NetCDF file, without too much pain.

Longer term, we would love to switch to some sort of distributed database that can handle multiple calculations streaming to it at once, but we haven't started to design this yet.

andrrizzi commented 4 years ago

@jaimergp implemented the parallel XTC files and we have now merged them in a separate feature branch (parallel-writing). If I remember correctly, it's not in master right now because we didn't observe a substantial speedup over netcdf (see Jaime's timings in #434), although we still need to test it on real calculations and there is much margin of improvement. For example, I think we essentially writing on disk twice now because MDTraj XTC file does not support append.

If you want to try the current state, let me know and I can update it with the new code from master.

Extracting trajectory information from storage file takes hours.

For this, the bulk of the calculation is in imaging the trajectory, I believe. Having parallel xtc files means we'll have to penalize reading the trajectory along a state in favor of replica trajectories. The netcdf file allows instead blocking the file by frame, which means reading state or replica trajectories will be roughly equally expensive so this issue may turn out to be quite complicated performance-wise.

jaimergp commented 4 years ago

I will write here some of the ideas we got after our talk with @Olllom.

There are performance issues while resuming calculations. All of the replicas accessing the monolithic NetCDF file at the same time becomes a bottleneck. It's not clear if this is due to A) Read operations being blocking, or B) Saturating the IO bandwidth in the machine. @Olllom, could you check this?

If option A) ends up being the problem, we could devise a short-term quick "fix" before the DAG-aware refactoring. This could consist of an optional keyword (disabled by default) that would write all the data needed for resuming calculations into separate files (format to be determined), in addition to the NetCDF file. We would make sure that there are no performance problems with this fast-access alternative (e.g. multiple files, memory cache, etc). Would you be ok with this approach, @Olllom?

If the IO bandwidth is being saturated, well, I don't think there's much else we could do...

choderalab / openmmtools

Parallelize storage reading/writing #429