alchemistry / fileformat

File formats for free energy calculations, molecular simulations, etc.
Other
2 stars 2 forks source link

Design decision: JSON object design #4

Open avirshup opened 7 years ago

avirshup commented 7 years ago

This is the big question: how is data laid out inside the JSON structure?

The best working example currently comes from @jchodera on slack:

{name:'molecule name',
 type:'molecule',
 provenance:{
    rcsbid:'A123'
 },
 topology:{
   [TOPOLOGY BLOCK containing atoms, bonds, residues, chains, etc.]
 }
 forcefield:{
  [OPTIONAL FORCEFIELD BLOCK]
 }
 states:{
  [ONE OR MORE DYNAMICAL STATE BLOCKS WITH PROPERTIES ATTACHED]
 }
}
avirshup commented 7 years ago

From @jchodera on slack:

I think we need to figure out what top-level structures "everybody" can agree on are important.

The list you provide (topology, properties, forcefields, wavefunctions, geometry/dynamics) sounds like it is a bit too heterogeneous in the level of abstraction. Instead, something simpler, like

  • topology (anything static)
  • state information (anything dynamical)
  • tool definitions and input parameters (which could include forcefields, QM levels of theory)
  • tool-computed input properties for specific states (which could include wavefunctions as well); could even be contained within a state definition, since it is associated with a given state
  • tool-computed input properties for the topology (which would include cheminformatics stuff, or stuff that doesn't depend on a specific state)
egonw commented 7 years ago

You can take advantage of standardization that has been done in the past. What about a JSON format based on the Chemical Markup Language specification? If you go JSON-LD then you have full semantics and full interoperability.

avirshup commented 7 years ago

I like the idea of incorporating parts of CML - for instance, CompChem dictionary, has a lot of good descriptive fields for QM computed properties.

@egonw - Thanks for pointing out JSON-LD, that actually seems like the solution to problem that we haven't created an issue for yet.

Also, would you mind pointing to some use cases for CML? I've been aware of it for a while, but haven't ever really done anything with it - it would be great to get a feel for the current use cases.

egonw commented 7 years ago

It's used in Bioclipse as it is the most verbose (explicit) file format, allowing us to store information we cannot store in other formats (like atom type info, which may be particularly useful when using custom force/new fields!).

While this never really picked up momentum, the original CML being XML, it also makes it really easy to use in other XML-documents (using the XML namespace standards), e.g. with CMLRSS (10.1021/ci034244p, green OA version).