Implement GSD output file for Cassandra

rsdefever commented 3 years ago

Is your feature request related to a problem? Please describe.

Cassandra currently saves the trajectory output to an xyz file. Though this is a general format, it contains minimal information about the trajectory, and contains no box information.

Describe the solution you'd like

An alternative output format is the general simulation data object or GSD. This is the native output format for HOOMD-blue. Particles and topologies can vary from one frame to the next, it has a Python API, can be read by tools like freud and OVITO, and is a binary format that supports random frame access.

There is a C API, so I think with fortran-C interoperability we could use the format without too much difficulty.

I'd like to get some input from the other developers on this idea.

ejmaginn commented 3 years ago

This sounds like a very good idea. Can we discuss this next time we meet? One question is should there be a single output format and then we have converters/translators from this? Does this already exist?

emarinri commented 3 years ago

Sounds like an interesting idea. A few questions:

1) Does it handle frames with changing number of particles? (i.e. GCMC, RxMC) 2) Does it handle frames with multiple boxes (i.e. GEMC, > 2 box simulations) 3) What are its dependencies? Would it affect how we build Cassandra at the moment? 4) Does the format change a lot as it is being developed or is it stable already? 5) Beyond Python, are there APIs for other languages? 6) I haven't played with it yet. How do you like its documentation? 7) How does it compare with the TNG format in GROMACS (see https://pubmed.ncbi.nlm.nih.gov/24258850/)?

rsdefever commented 3 years ago

Great questions. I'll answer what I can for now:

One question is should there be a single output format and then we have converters/translators from this? Does this already exist?

I think this is a reasonable idea. Especially to maintain xyz support. GSD->xyz would be trivial. I don't know to what degree other converters already exist.

Does it handle frames with changing number of particles? (i.e. GCMC, RxMC)

Yes. From their docs: "GSD allows all particle and topology properties to vary from one frame to the next."

Does it handle frames with multiple boxes (i.e. GEMC, > 2 box simulations)

Overall its a flexible format but I'm not sure about this. Also, would we want to write two boxes into a single trajectory file? I'm not sure if that will cause problems with visualization software...or analysis.

What are its dependencies? Would it affect how we build Cassandra at the moment?

I don't think there are any dependencies. See here. In terms of compilation I imagine we will have to make a few minor changes to support the fortran/C interop.

Does the format change a lot as it is being developed or is it stable already?

It is currently on the 2.x releases, AFAIK it should be stable and predictable.

Beyond Python, are there APIs for other languages?

Python and C APIs are available.

I haven't played with it yet. How do you like its documentation?

The docs I sent are pretty low-level, because they are docs for the GSD format, which is flexible. I think they are pretty complete. We would want to add some details into the Cassandra documentation in terms of "how to use GSD with Cassandra".

How does it compare with the TNG format in GROMACS (see https://pubmed.ncbi.nlm.nih.gov/24258850/)?

No idea. Great question though.

@joaander or @bdice can you provide some feedback on these questions when you have some free time? Thanks 👍

joaander commented 3 years ago

Does it handle frames with changing number of particles? (i.e. GCMC, RxMC)

Yes. From their docs: "GSD allows all particle and topology properties to vary from one frame to the next."

Yes, this feature was one of the main reasons I implemented GSD.

Does it handle frames with multiple boxes (i.e. GEMC, > 2 box simulations)

Overall its a flexible format but I'm not sure about this. Also, would we want to write two boxes into a single trajectory file? I'm not sure if that will cause problems with visualization software...or analysis.

GSD does not allow storing multiple boxes in a single file. As @rsdefever says, this is not something most visualization tools have native support for. You can always write multiple trajectory files.

What are its dependencies? Would it affect how we build Cassandra at the moment?

I don't think there are any dependencies. See here. In terms of compilation I imagine we will have to make a few minor changes to support the fortran/C interop.

GSD's C API is implemented in a single .h and single .c file using only C standard libraries. It builds on Linux, Mac, and Windows. I developed it with a pure C API to enable interoperability with as many tools as possible.

Does the format change a lot as it is being developed or is it stable already?

It is currently on the 2.x releases, AFAIK it should be stable and predictable.

Yes, it is very stable. 1.x was release in 2016. The 2.0.0 release (2020-02) made a slight tweak to the binary index representation to improve performance with many small data chunks. It is fully backwards compatible with 1.x files (it can read and write 1.x files).

I haven't played with it yet. How do you like its documentation?

The docs I sent are pretty low-level, because they are docs for the GSD format, which is flexible. I think they are pretty complete. We would want to add some details into the Cassandra documentation in terms of "how to use GSD with Cassandra".

One of the things you will need to document is units. HOOMD assumes an undefined, but self-consistent set of units. Let me know if you have any questions or if there are areas that docs could be improved.

How does it compare with the TNG format in GROMACS (see https://pubmed.ncbi.nlm.nih.gov/24258850/)?

No idea. Great question though.

I evaluated TNG before developing GSD. It is a much larger package with a more complex API and is primarily focused on compressing multiple frames together to reduce file size when the number of timesteps between frames is small. This assumes particles move very little from one step to the next. I don't recall all the considerations that went into my decision, but TNG (like most molecular file formats) lacks several features that we need for hoomd (particle orientations, angular momentum). I don't know if it supports variable numbers of particles.

I also looked into using HDF5 before writing GSD. HDF5 is a generic container format that can store dense data arrays by name. The HOOMD schema for GSD is based on this work. Ultimately, I choose not to go with HDF5 because it is a massive (many millions of lines of code) dependency that is not easy to install and there is no straightforward way to express variable numbers of particles per frame.

emarinri commented 3 years ago

Thanks for the replies @rsdefever @joaander! I think it sounds like a good idea. I have been thinking of using/creating a format like GSD for Cassandra to facilitate analysis from Python.

Sounds like users could select GSD or XYZ as an output format. An advantage of using GSD is that users could use the libraries MDAnalysis or Freud for analysis?

@rsdefever Let me know what you and Ed decide. Hopefully @ShahResearchGroup could provide some input too.

bdice commented 3 years ago

Adding to @emarinri’s comment:

An advantage of using GSD is that users could use the libraries MDAnalysis or Freud for analysis?

Yes, and the GSD Python API makes it straightforward to integrate with other applications through NumPy arrays.

ejmaginn commented 3 years ago

This sounds like a very good idea, as it would make analysis much easier.

Ed

On Wed, May 5, 2021 at 4:11 PM Eliseo Marin-Rimoldi < @.***> wrote:

Thanks for the replies @rsdefever https://github.com/rsdefever @joaander https://github.com/joaander! I think it sounds like a good idea. I have been thinking of using/creating a format like GSD for Cassandra to facilitate analysis from Python.

Sounds like users could select GSD or XYZ as an output format. An advantage of using GSD is that users could use the libraries MDAnalysis or Freud for analysis?

@rsdefever https://github.com/rsdefever Let me know what you and Ed decide. Hopefully @ShahResearchGroup https://github.com/ShahResearchGroup could provide some input too.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MaginnGroup/Cassandra/issues/100#issuecomment-832975381, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJKEXK7BBHHKK3ZCCVHJ5DTMGRAPANCNFSM44ETZKLQ .

rsdefever commented 3 years ago

@bdice @joaander thanks for the input. Very helpful.

Any thoughts on having our own schema vs. using HOOMD schema?

I need to dig back and see if I can find some old code...I started looking at this idea a year or so ago.

joaander commented 3 years ago

With your own schema, you could support multiple boxes, define units in the schema, and otherwise store all and only the data you want to in the way you want to. However, this comes at the cost of the time to develop the schema and it will not work out of the box with OVITO, MDAnalysis, VMD, and other packages that support the HOOMD schema. As long as you implement a reader for your schema in Python, it will work with Freud which is file-format agnostic. After you have a documented and tested schema, you will be able to work with tool developers to add support for it.

That being said, I did design GSD with the ability to support multiple schemas. If the tradeoffs are acceptable to you, I would welcome a pull request with documentation that defines your schema and, if you want, a Python API to read and/or write it.

jshahOSU commented 3 years ago

I think it is a good idea. I believe the output format being referred to here is that for the trajectory file. If not then, ihe checkpoint file writing and reading would need to be rewritten, which will break the backward compatibility. Not sure if read_old option is still being supported. If so, we need to write a parser that extracts the last configuration from the GSD output.

Does GSD contain box information? Would be useful for NPT and GEMC-NPT simulations. Are the converters to other file formats such as PDB, xtc (gromacs) etc.?

On Wed, May 5, 2021 at 3:11 PM Eliseo Marin-Rimoldi < @.***> wrote:

Thanks for the replies @rsdefever https://github.com/rsdefever @joaander https://github.com/joaander! I think it sounds like a good idea. I have been thinking of using/creating a format like GSD for Cassandra to facilitate analysis from Python.

Sounds like users could select GSD or XYZ as an output format. An advantage of using GSD is that users could use the libraries MDAnalysis or Freud for analysis?

@rsdefever https://github.com/rsdefever Let me know what you and Ed decide. Hopefully @ShahResearchGroup https://github.com/ShahResearchGroup could provide some input too.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MaginnGroup/Cassandra/issues/100#issuecomment-832975381, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEECJNLV2PZHHAWBD7CTTMGRAPANCNFSM44ETZKLQ .

MaginnGroup / Cassandra

Implement GSD output file for Cassandra #100

Is your feature request related to a problem? Please describe.

Describe the solution you'd like