ordering of lists in Molecule schema

loriab commented 5 years ago

If the rules are:

order of atoms in topology schema is absolute and may not be reshuffled and
lists are inherently ordered

then array fields like "masses" are fine b/c atoms are in order.

What of bonds and fragments, then? https://github.com/MolSSI/QCSchema/blob/415340821abfbb5fc72b4949576fe08762474dcb/qcschema/dev/molecule.py#L81-L106 I very much recognize that fragmenters and n-body drivers will have their own systems for defining and indexing fragments (and that fragment_multiplicities and fragment_charges must follow along) that must not be disturbed, but is there any reason not to require this field be sorted (e.g., [[5, 0], [4, 1, 3], [2]]) for ease of comparison? Same (and stronger, imo) case for sorting "connectivity" field.

Maybe beyond QC there's a good reason to leave these free-ordered? Should two json molecules whose schema differ by only [[5, 0], [4, 1, 3], [2]] vs. [[2], [5, 0], [4, 1, 3]] resolve to the same hash?

dgasmith commented 5 years ago

I guess on the flip side is there a reason to require ordering here and who does that enforcing since JSON Schema specs do not enforce it. I am hesitant to enforce requirements above and beyond what the base schema provides as it increases implementation details for adopters.

I think molecule hashing is a question for QCFractal itself and not particularly specific to the Schema itself.

langner commented 5 years ago

I don't quite understand the problem. If the ordering is unconstrained but fully specified, isn't possible to re-order whenever needed, for hashing of whatever else?

loriab commented 5 years ago

Taking a step back, here's three choices. Hope this makes more sense.

For the examples, consider a H2/He system where "QC" is a quantum chemistry program like psi that can't handle non-contiguous fragments and "FQC" is a fancy program that can.

(a) Fragment order doesn't matter to mol identity, fragments sorted for Schema
- this is what I was pushing for in original post, with the assumption that (a) vs. (b) was the choice. this (a) has some aesthetic advantages, but they're not pressing. happy to give up on (a).
- easy at-a-glance comparison of schema instances
- no need for schema processors to reorder frags
- visualizers will always show fragment A or bond B0 as same color
- examples
- [[0, 1], [2]] — Schema happy. QC happy. FQC happy. Both: {A: H2, B: He}
- [[2], [0, 1]] — Schema violated (though unenforceable to schemavalidator) as frags unsorted
(b) Fragment order doesn't matter to mol identity, Fragment order unconstrained for Schema
- I think this is what the schema is representing now, where fragment order matters only to correlate fragments across fragments, fragment_multiplicities, and fragment_charges fields. I think this'll work for a long time, as most QC programs aren't reporting per-fragment data (e.g., F-SAPT). As soon as programs do report per-frag data, this (b) may become problematic for the same reason that atom reshuffling is already recognized as problematic. Hence, my new question is (b) vs. (c).
- examples
- [[0, 1], [2]] — Schema happy. QC happy. FQC happy. Both: {A: H2, B: He}
- [[2], [0, 1]] — Schema happy. QC can sort fragments into [[0, 1], [2]] {A: H2; B: He}. FQC happy {A: He; B: H2} Conflict!
(c) Fragment order does matter to mol identity
- new rule that like atom order, schema processors must not reorder frags
- visualizers will always show fragment A or bond B0 as same color
- examples
- [[0, 1], [2]] — Schema happy. QC happy. FQC happy. Both: {A: H2, B: He}
- -- diff sys --
- [[2], [0, 1]] — Schema happy. QC stop as atoms out of order. FQC happy {A: He, B: H2}

dgasmith commented 5 years ago

Currently for QCArchive we use c) where order does matter to the most strict molecular identity due to the reasons mentioned above (there are less strict which this does not matter). I do agree that a) is the best scenario.

If we do a) the main question to me is how do we enforce this? At the moment we can say use json-schema draft4 validators or higher and everything is great. Stepping off json-schema validators we will need to say draft4 + extra stuff which starts to get complex. Saying that this is the "recommended" way of organizing fragments will likely cause odd conflicts in the future. Not sure I have a good answer here.

I do recommend trying to separate identity and the schema. QCA has multiple identity tags depending on the required precision and is fairly specific to the QCA framework. It would great to specify a generic identity here, but I believe would require much more input and work than the schema is currently receiving.

loriab commented 5 years ago

I agree that "draft4 + extra stuff" is to be avoided. I guess I consider that not much of the schema beyond structure is enforced now. There's nothing to prevent geometry from being (2 * nat,) or atomic number from being -3 or fragments from 1-indexing atoms or symbols from being all-caps or two atoms from being co-spatial. All those checks are part of the validation step in code (the extension of which sparked this issue), and I figured enforcement of (a) would be similar.

EDIT: That is, there's extra spec in the "description" subfield, and (a)/(b)/(c) would be similar.

I think (c) is best for future-proofing. (a) takes advantage of "fragmentN" being a dummy variable of sorts, but I can imagine it scaring nbody writers. (b) is what I think the Schema intends now, and it looks dangerous.

(Trying to keep identity and schema separated ...) The schema contains info on what the consumer can do to the molecule (e.g., if fix_com=True, then COM mustn't shift). There's also un-written info like consumer must not attach this molecule to results containing per-atom arrays (like gradients) if atoms have been reordered in the calculation thereof (recall that I guessed wrongly on this interpretation). Wherever that un-written rule goes, I just think that the analogous rule about fragments (diff btwn (b) and (c)) should also be clarified.

langner commented 5 years ago

Is there a place that collects these "un-written rules"? Maybe formalizing these conventions that a little would help with topics like these, where this is some expectation of use but it doesn't feel like a schema should enforce it outright.

cryos commented 5 years ago

I think the schema could only ever go so far, and we see similar issues with other formats, i.e. referring to an index that is out of range, or a unique id that doesn't exist. Schema gets you so far, then document expected use. I think order should absolutely matter as other options end up blowing up the need for unique ids, which also may not exist but as a security blanket of sorts.

I would go (c) all the way, and at some point would like to take a stab at a deeper validation of files once I am done with our implementation of the spec for Avogadro. Atom reordering, index reordering is tough. How would you hash even if things are ordered, remove all space, and ensure same numerical accuracy somehow? Hashing of JSON is tough as object order is not guaranteed, especially considering all the parsers, numerical accuracy, etc.

dgasmith commented 5 years ago

@cryos Thanks for the input. Do you have examples where schema document expected use? As @loriab implied we are already doing that with the (2 * nat, ) language. It would be best if we could represent this in a better way.

As per hashing JSON molecule representations, it is certainly doable but not particularly transferable outside of a single piece of code which is why it does need to remain isolated outside the schema. Numerical accuracy is the fun one where you have to pin rounding rules and (my favorite) flip zeros so that they are all negative/positive and also potentially orientating systems to a common frame. Feel free to ping me if you want to spin off that discussion.

cryos commented 5 years ago

I don't have one off the top of my head, I will try to take a better look when I have time to come up for air - so many deadlines these next few days...

loriab commented 5 years ago

It sounds like we don't know where a schema guru would go looking for "use guidelines". But is my impression that we're settling toward (c) correct? Any reason not to plop 'em into molecule.py (both atom-ordering and frag-ordering guidance) until "correct" location identified?

Back to 'connectivity' in https://github.com/MolSSI/QCSchema/blob/master/qcschema/dev/molecule.py#L81-L94), I withdraw my first-post wish to constrain to sorted (equiv. of (a)). Do we agree that the schema's intent for bonds is (b)? (I think anyone using QC data simply expects bond/angle/etc. ordering to differ between programs and geometries.) And am I misreading the connectivity snippet or is the bond between atoms 6 & 7 going to fail schema validation?

MolSSI / QCSchema

ordering of lists in Molecule schema #55

(a) Fragment order doesn't matter to mol identity, fragments sorted for Schema

(b) Fragment order doesn't matter to mol identity, Fragment order unconstrained for Schema

(c) Fragment order does matter to mol identity