MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.26k stars 639 forks source link

Implement HDF5 parallel features for H5MDReader #2865

Open edisj opened 3 years ago

edisj commented 3 years ago

Is your feature request related to a problem?

To use [h5py's parallel features](), you need to pass driver and comm arguments when you open a file, like this:

f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=MPI.COMM_WORLD)

We'd like to add the ability to use these arguments with the H5MDReader (see PR#2787), but there are some methods (below) that could be a problem due to the stream not being reopened in the same way with driver and comm.

Updated: Following up on issue #2890, to pickle H5MD files opened with driver="mpio" and comm=MPI.COMM_WORLD, we need a way to store the MPI.Comm object used to open the file.

Describe the solution you'd like

I would pull the keyword arguments, driver and comm, out of the mda.Universe arguments and store them as self.driver and self.comm prior to this line: https://github.com/MDAnalysis/mdanalysis/blob/618f7647d3c6c4657945416490191d87adc10fe5/package/MDAnalysis/coordinates/H5MD.py#L316

Then, perhaps _reopen and open_trajectory could perform checks to see if the arguments are passed and instead of closing the file, it rewinds it to the first frame.

Updated: Store comm as an argument in an __init()__ method when calling H5PYPicklable, and use some sort of functions that can pickle the MPI.Comm object similar to https://bitbucket.org/mpi4py/mpi4py/issues/104/pickling-of-mpi-comm

Describe alternatives you've considered

Can't think of any other way at the moment

Additional context

H5MD format pyh5md package H5PY documentation


EDIT: Updated issue text after issue #2890

orbeckst commented 3 years ago

Do we have a way to examine the h5py.File object and know that it has a driver and/or a comm set? If so, then we could do what you suggest and only do a seek to the beginning instead of close/open if we know that we don't have enough information to reopen in the same way.

We almost certainly also need to think about this in the context of @yuxuanzhuang 's picklable/serializable readers, see PR #2723 . (At the moment I don't know if it will be necessary to serialize a reader that's using an MPI communicator already. Normally we would launch multiple copies of the same script with mpirun and we would not require a serialization mechanism unless we want to mix, say, Dask with MPI with dask-mpi)

edisj commented 3 years ago

That's a good idea. I'll be able to check once I play around with mpi4py. I've managed to build parallel hdf5 and have parallel h5py and mpi4py installed on the workstation. Just trying to copy over my branch's mdanalysis so it should be up and running soon.

edisj commented 3 years ago

So from what I can find so far, there's a couple ways to see if the file has been opened with parallel drivers -

The nice way is to do something like

f = h5py.File('filename.h5md', 'r')
f.driver 

which will spit out 'mpio' if the file was opened with the parallel driver. I think this is a convenient way to check. I think all files opened with h5py.File have a driver attribute. Here's a list

The other way is to do f.atomic which raises an error if the driver isn't 'mpio' (I'm not sure how it works with other drivers though). But in any case I don't think we'd use that to check

I think just checking f.driver should work nicely. What do you think?

orbeckst commented 3 years ago

We will want to check if one can serialize a parallel H5MDReader , ie if MPI.Comm can be serialized. This will determine how we can use parallel reading. See also #2890.