MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.3k stars 648 forks source link

Downloading / Streaming Trajectories from MDDB #4603

Open BradyAJohnston opened 4 months ago

BradyAJohnston commented 4 months ago

It's very early days, but the Molecular Dynamics Database (MDDB) is starting to do some very initial hosting of datasets. (https://mmb.mddbr.eu/#/browse)

I don't suppose it would be a priority anytime soon for integrating with it, but in the future potential to download / stream topologies & trajectories would be an interesting functionality.

Would this be something within the scope of MDAnalysis? There is an initial REST API, but I believe it's all pretty subject to change of the coming months / years as the project matures a bit more.

IAlibay commented 4 months ago

cc @philbiggin - probably worth checking if this something already planned within the consortia / something someone is looking to do.

philbiggin commented 4 months ago

I don't believe it's on the near-horizon window if I can put it like that, but certainly worth reminding folk about. Yes - the API is probably susceptible to change as we already know some things that need addressing (although @adamhospital and @d-beltran can comment further for sure!)

d-beltran commented 4 months ago

Hi everyone and thank you for your interest in the MDDB.

You are right, things may change in the long term since this is still a prototype. We could try to be as back-compatible as possible to support some early integration in MDAnalysis. If you need any assistance to do this please reach out!

orbeckst commented 4 months ago

ping @hmacdope @ljwoods2

hmacdope commented 2 months ago

@ljwoods2 would you be able to detail our approach here? We have a prototype of H5MD (slated as future format for GMX, https://gitlab.com/gromacs/gromacs/-/issues/5016 / MDDB: https://gitlab.com/groups/gromacs/-/epics/5 ) streaming working IIRC

ljwoods2 commented 2 months ago

Yes, I'm working on this for my GSOC project!

The approach right now is to make H5MD file streamable from cloud services by first reformatting the metadata of the h5 file using kerchunk into a form Zarr can parse and then passing this metadata (containing the byte ranges of the datasets in the h5 file) to fsspec to create a "reference" filesystem that can be opened by Zarr. I've found this site to have the best description of how kerchunk works and how to use it.

So far, this is working for reading h5md files from s3, but we haven't tested other cloud services yet. This approach also doesn't allow writing h5md files to cloud services, either, and this would require doing something different like passing an s3fs object to h5py.

You can save kerchunk translated h5 metadata to json and use it later so that you can access the same remote h5 file again via zarr without the added overhead of converting the byte ranges, compression, etc a second time, but we aren't currently using this in the initial prototype- not sure if this would potentially be helpful for something like MDDB.

Finally, one interesting thing is that since Zarr-python has intentionally made their api (mostly) identical to h5py, and since zarr includes a directory-like layout, groups, datasets, and attributes just like h5, we've been able to easily convert h5md files into zarr files that use the h5md format/directory layout and treat them the same as we would an actual h5-backed h5md file in the file reader. The only caveat for this is that Zarr doesn't yet support linking datasets like h5 does, so the format does not translate perfectly, but in every other way, including api interactions with the file, it is the same. It's not yet clear if zarr will ever support links AFAIK.