jamesrhester commented 3 months ago

As initially noted in #11 by @vaitkus

It is currently unresolved how software should interpret a multi-block data file when one or more of those blocks has been written against a separate dictionary that defines a category in a different way. Two motivating examples:

A multi-temperature dataset where high temperatures are single crystal measurements, and low temperatures are powder diffraction measurements due to a destructive phase transition. The complete data set contains some data blocks written against core CIF, and some written against pdCIF, where pdCIF imports core CIF and adds extra key datanames to REFLN and changes how some REFLN data names are calculated. The two REFLN categories can't be directly concatenated as we would normally do.
A multi-temperature magnetic powder diffraction experiment. Some data blocks will involve pdCIF, some will involve magCIF, both of which independently import coreCIF and both of which affect how some quantities are calculated. Even if we can concatenate category loops, the derivation of the data names varies.

Our multi-block work has assumed a single dictionary, or multiple non-overlapping dictionaries, orchestrating the merging of separate blocks into a single relational model. Those situations are covered well.

What we need to do is to develop principles for handling these more complex situations.

jamesrhester commented 3 months ago

Some assorted ideas to work towards a solution. The goal is to merge blocks so that we end up with a single set of relational tables (at least notionally) and no ambiguity in the way in which data name values were derived (there should already be no ambiguity in how they values are used downstream.)

We could dictate that, if categories or any pre-existing data names are changed by a dictionary, then those categories cannot be merged but instead are considered distinct.
- this complicates the downstream use of values: F_meas is (or should be) used identically regardless of which version of the category it comes from. The way in which software plans to use the information will guide whether or not it merges categories or not.
We cannot simply require that all data blocks must conform to a single set of non-overlapping dictionaries, as this fails in the pdCIF/magCIF example because there are two equally valid choices of REFLN and both dictionaries are required.
We can explicitly define data names that signal how a data name was calculated. While we don't normally like this type of data name, we are already doing this implicitly when we derive F_calc in a loop containing items from the powder CIF dictionary differently to the core CIF case. More concretely, if we add a _structure.calculation_formalism data name to the new STRUCTURE, then an imaginary dREL routine for F_calc could switch calculation methods based on whether or not the value was based on a powder measurement or not.
Maybe we can merge loops from the same category but different dictionaries by first expanding them to have the same key data names, filling in missing values with some default. The approach of (3) would then need to capture everything that was specific to a particular dictionary (not an unworthy goal).
In a variation of (2), we could define some priority in our AUDIT_CONFORM dictionary list, so that all data blocks are written according to the master dictionary that results from merging dictionaries in the given priority.

Other ideas welcome, the simpler the better.

jamesrhester commented 1 month ago

Right: here is my starting proposal.

Rule

Whenever a collection of data blocks separately conform to different dictionaries, all data blocks must include their own AUDIT_CONFORM loop.

Technical Interpretation

Conformance

The AUDIT_CONFORM category, when merged, acts as though each row has a pointer to the block it was contained in. Similarly, when determining the dictionary conformance of any given row of a merged category, a child pointer of the block id from whence that row came is considered present. Thus, the conformance for each row in the merged category is well-defined.

Missing keys

All merged datablocks have key data names that are the union of all key data names defined by conforming dictionaries describing that category. The values of missing key data names for those loops that do not contain them is . (nothing). . is preferred to missing so that identical key value sets remain identical and automatic dREL lookups using the full set of dREL key values for a given row are successful.

Comments

We try and avoid at all costs bringing block pointers into the merged view of multiple data blocks, as data blocks have no existence in the merged data set and so are meaningless - except where conformance is concerned.

COMCIFS / MultiBlock_Dictionary

How are multi-block files combined when written against separate dictionaries? #13

Rule

Technical Interpretation

Conformance

Missing keys

Comments