Open jamesrhester opened 3 months ago
Some assorted ideas to work towards a solution. The goal is to merge blocks so that we end up with a single set of relational tables (at least notionally) and no ambiguity in the way in which data name values were derived (there should already be no ambiguity in how they values are used downstream.)
We could dictate that, if categories or any pre-existing data names are changed by a dictionary, then those categories cannot be merged but instead are considered distinct.
F_meas
is (or should be) used identically regardless of which version of the category it comes from. The way in which software plans to use the information will guide whether or not it merges categories or not.We cannot simply require that all data blocks must conform to a single set of non-overlapping dictionaries, as this fails in the pdCIF/magCIF example because there are two equally valid choices of REFLN
and both dictionaries are required.
We can explicitly define data names that signal how a data name was calculated. While we don't normally like this type of data name, we are already doing this implicitly when we derive F_calc
in a loop containing items from the powder CIF dictionary differently to the core CIF case. More concretely, if we add a _structure.calculation_formalism
data name to the new STRUCTURE
, then an imaginary dREL routine for F_calc
could switch calculation methods based on whether or not the value was based on a powder measurement or not.
Maybe we can merge loops from the same category but different dictionaries by first expanding them to have the same key data names, filling in missing values with some default. The approach of (3) would then need to capture everything that was specific to a particular dictionary (not an unworthy goal).
In a variation of (2), we could define some priority in our AUDIT_CONFORM
dictionary list, so that all data blocks are written according to the master dictionary that results from merging dictionaries in the given priority.
Other ideas welcome, the simpler the better.
Right: here is my starting proposal.
AUDIT_CONFORM
loop.The AUDIT_CONFORM
category, when merged, acts as though each row has a pointer to the block it was contained in. Similarly, when determining the dictionary conformance of any given row of a merged category, a child pointer of the block id from whence that row came is considered present. Thus, the conformance for each row in the merged category is well-defined.
All merged datablocks have key data names that are the union of all key data names defined by conforming dictionaries describing that category. The values of missing key data names for those loops that do not contain them is .
(nothing). .
is preferred to missing so that identical key value sets remain identical and automatic dREL lookups using the full set of dREL key values for a given row are successful.
We try and avoid at all costs bringing block pointers into the merged view of multiple data blocks, as data blocks have no existence in the merged data set and so are meaningless - except where conformance is concerned.
As initially noted in #11 by @vaitkus
It is currently unresolved how software should interpret a multi-block data file when one or more of those blocks has been written against a separate dictionary that defines a category in a different way. Two motivating examples:
REFLN
and changes how someREFLN
data names are calculated. The twoREFLN
categories can't be directly concatenated as we would normally do.Our multi-block work has assumed a single dictionary, or multiple non-overlapping dictionaries, orchestrating the merging of separate blocks into a single relational model. Those situations are covered well.
What we need to do is to develop principles for handling these more complex situations.