alchemistry / fileformat

File formats for free energy calculations, molecular simulations, etc.
Other
2 stars 2 forks source link

Write a justification for why we need a format? #12

Open davidlmobley opened 7 years ago

davidlmobley commented 7 years ago

Maybe we should update the README.md to briefly explain why we're talking about introducing a new file format in order to make it easier to bring the community in for discussion. Right now, people coming here may not immediately see the reason for this or the real need.

Also, of course, people may just think of this XKCD and leave: https://xkcd.com/927/

Probably, @avirshup , you already have a lot of good arguments as to why we need one, so it might just be a matter of summarizing some of them. :)

davidlmobley commented 7 years ago

For example, it would probably be good to get the people connected with the NSF MolSci initiative on here (i.e. Shantenu Jha) but it would be nice to have something on the front page talking about "why" before we do so.

pdebuyl commented 7 years ago

Hello, Speaking of file formats, I'd like to point out existing initiatives that can help.

  1. Mosaic by @khinsen "is a modular set of data models and file formats for molecular simulation.". It was designed with biomolecular systems in mind and was already used in research projects (see here for papers citing the original mosaic paper).
  2. H5MD "is a file format specification, based upon HDF5, aimed at the efficient and portable storage of molecular data (e.g. simulation trajectories, molecular structures, …)."

H5MD is being implemented in espresso and espresso++, has a basic implementation in LAMMPS. It lacks in specific support for biomolecular system but can be used as a storage layer in Mosaic or, else, a specific H5MD module can be created for this purpose.

Not only for others wishing to help here and for your project, it is very important to have a strong and specific motivation to build this new format :-) H5MD has a mailing list if you wish to discuss this.

Disclaimer: I am an author of H5MD.

Regards, Pierre

pdebuyl commented 7 years ago

PS: there is connectivity information in H5MD that allows to define pairs, triplets, etc. The intented use is for force-fields and/or polymers structure.

khinsen commented 7 years ago

I just discovered this project via Twitter. Not knowing anything about the background, my first question is not so much "why a new format" but "which application domain(s) will be covered?" Whether or not it is a good idea to come with a new format depends essentially on what it is meant to be used for. This includes the type(s) of molecular system, the type(s) of simulations/analysis, the type(s) of storage (temporary, archival, ...) and the type(s) of machines/platforms that matter most.

@pdebuyl already mentioned my MOSAIC effort. I suggest everyone seriously interested in file formats should read the paper that documents MOSAIC, because it explains the reasoning behind various choices. You may disagree with my specific choices, but I think the criteria I discuss are of interest nevertheless.

Perhaps the most important design decision in MOSAIC is a two-layer approach, with a data model and several concrete implementations. Low-level fileformats need to be designed for efficiency, leading to different choices for different scenarii. A visualization Web service works best with JSON or XML data, which however would be a terrible choice for storing large trajectories. The MOSAIC two-level approach permits any number of file formats with the guarantee of simple and lossless conversion between them whenever necessary.

avirshup commented 7 years ago

Thank you @khinsen and @pdebuyl for these extremely helpful comments. It's very heartening to see others working on this same problem, and even more heartening to see that they may have actually solved a lot of it already! So, before I do anything else for this project, I'm going to put some more work into #13.

We have begun trying to nail down the application domains in issues #1 and #10. In terms of file formats, only issue I personally insist on is the ability to read/write files as JSON. For that reason, I really like the "one data model, many formats" idea in MOSAIC, so that users can in principle get the portability of JSON but the performance of HDF5.