alchemistry / fileformat

File formats for free energy calculations, molecular simulations, etc.
Other
2 stars 2 forks source link

Evaluate alternative formats #13

Open avirshup opened 7 years ago

avirshup commented 7 years ago

Speaking personally, the best outcome of this would be to find that someone has already solved the problems we're thinking about, or at least at least has a solution that can be extended to cover this project's specific application focuses (#1 and #10)

Below is a continuously-updated list of other projects. Everything here should be with a grain of salt, as it's an attempt to glean information from many different specifications :)

MOSAIC Formats: XML, HDF5 (with straightforward extensions based on the data model) License: Creative Commons 3.0 Units: List of supported units in spec Design criteria: https://mosaic-data-model.github.io/design_criteria.html Data stored: topology, CG info, selections (i.e., subsets of the file's data), references to other "universes"; "properties" (unclear if these are whole-system properties or atomic properties?) Specification: https://mosaic-data-model.github.io/

Rich molecule format todo

H5MD Type: Binary (HDF5) Self-describing: yes Domain: molecular dynamics Flexible units: yes Human readable: not without HDF5 viewer License: GPL (need to understand copyleft implications here) Data stored: MD state data; atoms and their connectivity. Arbitrary atom lists/groups can be defined (nothing specific for chains/residues/etc) Specification: http://nongnu.org/h5md/

Amber NetCDF Type: Binary (HDF5) Self-describing: Yes Human readable: not directly (GUI viewers available) Flexible units: yes Domain: biomolecular dynamics Data stored: Trajectory (no topology) Specification: http://ambermd.org/netcdf/nctraj.xhtml

MDTraj HDF5 Type: HDF5 Self-describing: Yes Human readable: no Data stored: Trajectory (+ topology as a JSON string) Flexible units: yes Domain: biomolecular dynamics Specification: https://github.com/mdtraj/mdtraj/wiki/HDF5-Trajectory-Format Notes: This is an extension of Amber NetCDF format. FF-focused topology storage (as JSON)

Chemical Markup Language (CML) Type: XML Human readable: yes Self-describing: sort of - must adhere to a schema Specification: http://www.xml-cml.org/ Flexible units: yes Domain: small molecule modeling Data stored: coordinates, molecular properties, topology w/ stereochemistry, calculation parameters, electronic wavefunctions, computational metadata (i.e. hostname, programVersion, etc.). No support for biomolecules or trajectories. Notes: I like this project's aims, but there's a LOT of conceptual overhead for understanding XML schema. I don't think I've ever used software that supports CML.

PDBx / MMCif Type: CIF (text-based, see http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax) Self-describing: yes Flexible units: no Domain: Crystallography / NMR Specification: http://mmcif.wwpdb.org/docs/tutorials/content/atomic-description.html http://mmcif.wwpdb.org/docs/tutorials/content/molecular-entities.html Notes: Vast improvement over original PDB. Medium-to-high conceptual overhead. Parsers are still hard to come by. Data stored: everything you'd expect in a PDB file: topology + coordinates + crystallographic metadata.

Chemical JSON Type: JSON (text-based) Self-describing: yes Notes: I think this is more of a proof-of-principle (implemented in Avogadro) than a mature spec, but interesting nonetheless. JSON is by far the easiest language here to read and write, both with machines and by hand. Specification: http://wiki.openchemistry.org/Chemical_JSON

khinsen commented 7 years ago

@avirshup Two comments on your summary of Mosaic:

1) Mosaic properties are per-atom or per-site properties. Think masses, charges, force-field atom types (perhaps better served by Mosaic labels), etc.

2) The list of units in the Mosaic spec can easily be extended if necessary. The point of having it is not to limit the list of units, but to ensure a unique spelling for each one.

That said, my experience is that units in a file format are a mixed blessing. The more flexibility a format provides for units, the easier it is to write data but at the same time it becomes harder to read data, because the reader must know all the units with their conversion factors and apply them properly. If I were to design a closed file format (i.e. fully specified without any possibility for extensions), I'd prescribe a single unit for everything. Google for "convention over configuration" for discussions of the advantages of such an approach. Mosaic being open-ended (properties, for example, can be anything and thus have unforeseeable units), that was not an option.

khinsen commented 7 years ago

Also worth looking at is the CCPN data model, developed for describing NMR data on biomolecular systems. Like the Rich Molecule Format, it is designed to be used as a software API rather than as a file format, but once you get used to the concept of a data model, that becomes an implementation detail.

khinsen commented 7 years ago

An important evaluation criterion missing from the above summary is openness for extensions. As an illustration for the utility of openness, Mosaic and H5MD were designed completely independently, but both were made open for extensions. Gluing the two together was almost trivial, as the very short spec of the interface demonstrates.

CML is as open as any XML-based format, meaning that you can embed it into a higher-level format, or define a derived schema which is no longer CML but shares features with it. CIF would be easy to make open from a technical point of view, but wwPDB retains full control over its evolution (which is probably a good thing in its specific environment). All the other formats are not open as far as I know, though I didn't look at each of them in much detail.

arose commented 7 years ago

I would like to add a format we recently put into production on http://www.rcsb.org, see http://mmtf.rcsb.org/

MMTF (macromolecular transmission format) Type: messagepack/json Self-describing: Human readable: no Data stored: coordinates, topology, some metadata Flexible units: no Domain: efficient transmission and parsing of macormolecules Specification: https://github.com/rcsb/mmtf/blob/master/spec.md

arose commented 7 years ago

related rdkit discussion: https://github.com/rdkit/rdkit/issues/1137

avirshup commented 7 years ago

Thanks @arose! Also mentioned in the rdkit discussion you linked to is http://stuchalk.github.io/scidata/ , which seems extremely relevant, particularly http://stuchalk.github.io/scidata/contexts/scidata_molsystem.jsonld