OpenChemistry / chemicaljson

Development of the Chemical JSON data representation
63 stars 15 forks source link

Specification not in sync with avogadrolibs implementation #15

Open awvwgk opened 2 years ago

awvwgk commented 2 years ago

The specification provided for chemical JSON in this repository seems to behind the actually supported implementation in https://github.com/OpenChemistry/avogadrolibs/blob/master/avogadro/io/cjsonformat.cpp.

Where can implementors of chemical JSON find an up-to-date specification or is the implementation in avogadrolibs the source of truth for this format?

ghutchis commented 2 years ago

Great question - I saw your earlier tag. Partly, this has been due to lack of time. (I'm behind on writing documentation for Avogadro 2 as well. 😄 ) Partly, it's been not knowing the best mechanism to encode the spec (e.g., using something formal like Pydantic.)

A somewhat more readable "ground truth" is in https://github.com/OpenChemistry/avogadro-cclib/blob/main/cclibScript.py which I'm working hard to keep that up-to-date and merge into cclib as well.

I've been discussing also with the cclib team on other attributes.

A few things that are definitely missing:

Right now, it supports IR / Raman, electronic, and NMR spectra, but only in a limited way because plotting has been somewhat complicated in Avogadro 2.

Suggestions / discussion are definitely welcome.

ghutchis commented 2 years ago

The format is used by other code, particularly the MongoChem web components, but these are mostly extensions or reverse-engineering the work done for the web components.

awvwgk commented 2 years ago

I would prefer a spec to implement against instead of another implementation, but I'll work with what is available.

total charge / total spin

Only half of the formats xtb supports via mctc-lib have this (Turbomole coords, CTFiles like mol/sdf, QCSchema JSON, QChem molfiles, Gaussian external inputs), this is why we have overwrites via cmd-args or .CHRG / .UHF files, so not a big issue. Of course we loose the information when writing in the same format since xtb preserves the input format.

berquist commented 2 years ago

using something formal like Pydantic

JSON Schema (either from the spec or as produced by Pydantic) has some annoying limitations, but overall my experience generating schemas and JSON blobs via Pydantic is great, and certainly better than hand-writing JSON.

ghutchis commented 2 years ago

I would prefer a spec to implement against instead of another implementation, but I'll work with what is available.

In my experience, both a spec, and an implementation and/or live examples are critical, but I understand completely.

I just discovered https://jsontopydantic.com which should help writing a more formal spec and Pydantic model. Given my workload this week and a conference next week, this may take a while, but I'm happy to answer questions in the meantime.

awvwgk commented 2 years ago

I wonder what is supposed to happen if a chemical JSON such as

{"chemical json": 2, "chemicalJson": 1}

is provided, should the reader reject it or pass it like currently done in Avogadro 2? This behaviour is somewhat inconsistent with

{"chemical json": 1, "chemicalJson": 2}

which is going to fail due to the version number specified.

The specs don't mention that there are alternative keys and how an implementation should deal with duplicated entries.

ghutchis commented 2 years ago

In general in JSON, non-matching keys are ignored. So a key of "chemical json" is irrelevant, because it's not "chemicalJson".

At the moment, there are not alternative keys and I'm not sure that would be a good idea.

awvwgk commented 2 years ago

At the moment, there are not alternative keys and I'm not sure that would be a good idea.

This is not true if I check the actual implementation at

https://github.com/OpenChemistry/avogadrolibs/blob/55dec1df36c574a456d18735032b8e29928a28b0/avogadro/io/cjsonformat.cpp#L95-L101

which allows both chemical json and chemicalJson as key for the version numbers, same for several other keys.

ghutchis commented 2 years ago

Dealing with spaces in JSON keys is more cumbersome in JavaScript as well: https://stackoverflow.com/a/4968448/131896

molecule.atoms vs molecule["vibrational modes"]

ghutchis commented 2 years ago

This is not true if I check the actual implementation

I think that's due to some unfortunate backwards compatibility. It's a bad idea and I don't recommend it.

As I mentioned, Avogadro now always writes Cartesian coordinates and cell vectors even with periodic systems. (Very old versions would convert to fractional and write parameters .. but you drop useful information.)

Yes, it's why a spec is needed vs. an implementation that tries to ensure backwards compatibility with not-quite compatible files.

awvwgk commented 2 years ago

Yes, it's why a spec is needed vs. an implementation that tries to ensure backwards compatibility with not-quite compatible files.

This is good to know, since I got the impression that I have the choice as implementor which key I can write, but have to always support both variants when reading. So the camelCase ones are the correct ones? (I should go change this than in my implementation)

ghutchis commented 2 years ago

My general view on parsers is flexible on read, but correct on write. (This is motivated particularly by "PDB" files. 🙄

Let's also agree on properties => "totalCharge" and properties => "totalSpinMultiplicity" and I promise I'll get that into avogadrolibs tonight.

awvwgk commented 2 years ago

Let's also agree on properties => "totalCharge" and properties => "totalSpinMultiplicity" and I promise I'll get that into avogadrolibs tonight.

totalCharge and totalSpinMultiplicity are fine with me.

My general view on parsers is flexible on read, but correct on write.

I have to disagree on this one, every flexibility allowed in the parser will get used at some point by somebody and is than expected to be supported for all future. Even worse, future implementations are expected to support the same extensions (like this SDF stuff https://github.com/grimme-lab/mctc-lib/issues/49). A parser should enforce the correctness of the format, rigorously.

awvwgk commented 2 years ago

Thanks for https://github.com/OpenChemistry/avogadrolibs/pull/1058, however this doesn't solve the issue that avogadrolibs implements something else than documented here.

ghutchis commented 2 years ago

I'm aware - I'm at a conference this week, so my time to write a spec is limited. I will certainly write up a more complete current spec in the next few weeks.

ghutchis commented 2 years ago

Fortunately, it was a bit easier to push things into Pydantic than I worried. Here's a start, which will turn into a pull request. https://gist.github.com/ghutchis/3a31e5b597dafb15de9ea048cad35268

Yes, some of these examples are really old and need to be removed.