Some legacy categories should be split but how?

jamesrhester commented 4 years ago

The atom_type and atom_type_scat categories contain data names that were originally all in a single atom_type category in DDL1. Some of these data names depend both on atom type and wavelength, and still others depend on the particular structure (_atom_type.number_in_cell). In the context of a single-crystal, single-wavelength, single-phase experiment it is efficient to present this information in a single category in this way, but for other experiments (eg multiphase powder diffraction using combined measurements of neutron and X-ray) there is a considerable amount of duplication even when information is split over multiple data blocks.

In DDLm atom_type_scat is a child category of atom_type which essentially means that some scattering information for atoms can be left out, but the keys of atom_type and atom_type_scat must still be the same which results in no improvement of the situation.

jamesrhester commented 4 years ago

The root cause of the problem is that the various members of atom_type actually belong in different categories. Categories are fundamentally just the data names that share the same set of key data names. Because both 'compound.id' and 'diffrn.id' are single-valued in the classic CIF core, their child data names were not taken into account (arbitrary single value, dropped from the category).

Adding them back in doesn't help the situation, because atom_type_scat as a child category should have the same set of keys as its parent, so even if the various compound.id and diffrn.id values are split over multiple data blocks, the atom_type data names that do not actually depend on diffrn.id are still forced to take it as a key and therefore be tabulated.

Here's an idea: we extend the meaning of 'parent'/'child' to say that the current relationship only applies if all keys are linked.

vaitkus commented 4 years ago

Could you please provide an example of the situation where the "parent/child" relationship whould apply and one where it whould not?

jamesrhester commented 4 years ago

The problem

Let me describe the problem in some detail before describing a solution. As the problem only arises in more complex scenarios, imagine a powder diffraction experiment performed using two different radiations (neutron and X-ray) on a sample containing two different compounds (known as 'phases' in powder diffraction). Assume that we have added child key datanames of _diffrn.id (for the different radiations) and _phase.id (for the different compounds) to the atom_type and atom_type_scat categories.

To present this data, we can split every distinct value of _phase.id into a separate data block, and every distinct value of _diffrn.id into a separate data block, giving 2*2 = 4 data blocks if there are any categories that have child key data names of either of these, plus another with the "global" information. The atom_type category does have child key data names of both of these. In each block, the atom_type_scat information is identical for identical _diffrn.id (as it is independent of phase.id), and for each _diffrn.id value things like _atom_type.number_in_cell are identical for identical phase.id.

So the problem is this repetition, which is likely some failure to reach 4th or 5th normal form in relational terms. As the number of phases increases, it gets more annoying. Also, note that some of the entries in the atom_type loop are universal and could be presented once in the "global" block.

Possible solution

Currently the parent-child category relation is understood as meaning that separate category loops can be left outer joined on the key data names. This implies that their key data names are children of the same parent data names. One possible way to improve the situation described above is to specify that a child category is only mergeable with the parent when the key data names are related in the above way.

In this case, an extension dictionary could add a child of _diffrn.id to atom_type_scat and a child of phase.id to atom_type, thereby rendering them incompatible and forcing them to be tabulated in separate loops. This means that a block corresponding to a single value of phase.id would contain only the atom_type loop, and a block corresponding to a single value of diffrn.id would contain only the atom_type_scat loop.

So in summary, the parent-child relation would not apply whenever the keys of the parent and child categories do not share common parents.

vaitkus commented 4 years ago

Thank you for clarifying. The approach seems sound.

Alternative ways of achieving the same result would be: 1) Adding a child of _diffrn.id to ATOM_TYPE_SCAT category, a child of phase.id to the ATOM_TYPE category and redeclaring the ATOM_TYPE_SCAT category as having a different parent category (i.e. the HEAD category of the dictionary); 2) Reintroducing the previously removed _category.parent_join data item (see 1) that specified if a category list could be merged with the parent list.

For me the 1) option seem slightly clearer than the one you suggested since it does not require to modify the existing definition of a parent-child relationship. However, there may be some arguments against redefining the parent category that I am not aware of.

jamesrhester commented 4 years ago

I agree that redeclaring the parent category is simpler than modifying the definition of the parent-child relationship. Sounds like a plan. I will put it out for consultation, but as the DDLm group have a few things on their plate at the moment I'll try to get those resolved before dropping this on them!

jamesrhester commented 3 years ago

One thing to remember is that dREL methods allow addressing the child categories data values using just the child category object name and parent category name. Any method in atom_site that does this would need to be (trivially) rewritten to explicitly reference the former child category and use the appropriate key data name values to access the category.

COMCIFS / cif_core

Some legacy categories should be split but how? #177

The problem

Possible solution