COMCIFS / cif_core

The IUCr CIF core dictionary
14 stars 9 forks source link

Proposal: split out multi-block-related data items to separate dictionary #448

Closed jamesrhester closed 9 months ago

jamesrhester commented 12 months ago

See #442 and #445 for earlier discussion.

I think it would be a good idea to create a multi-block dictionary containing data names required for description of multi-block data sets, that is, Set category keys and their children. The multi-block dictionary would import the core dictionary and thus be an expansion of it.

Advantages:

  1. Data descriptions requiring multiple blocks (eg powder) would import the multi-block dictionary as a way of indicating this.
  2. _audit_conform.dict_name etc. can indicate the multi-block dictionary as a way of signalling the use of multiple blocks to validation software
  3. The core dictionary is not cluttered by data names that are mostly elided in data files
  4. The multi-block dictionary provides a focal point for discussing and demonstrating multi-block data representations
  5. Barrier to entry for dictionary and software authors is lower.

Disadvantages

  1. Someone encountering a multi-block data name may look for it in the cif core dictionary and not find it. This is no different to finding a powder-related name and not seeing it in the core dictionary, so not a big problem.

Some simple technical questions:

  1. Should this dictionary have its own repository or be bundled with cif_core.dic, templ_attr/enum.dic and ddl.dic? I think no to bundling, as bundling would require us to pin down the relationships at the same time as core updates.
  2. Should we remove the earlier items that we have included in the core dictionary into this multi-block dictionary? Yes?

There may be other technical questions, but they can wait until we've decided on a new repository for it or not.

@vaitkus @jcbollinger @rowlesmr Do we agree this is a good idea?

vaitkus commented 12 months ago
  • Should this dictionary have its own repository or be bundled with cif_core.dic, templ_attr/enum.dic and ddl.dic? I think no to bundling, as bundling would require us to pin down the relationships at the same time as core updates.

A separate repository or a separate branch would be cleaner, IMHO.

  • Should we remove the earlier items that we have included in the core dictionary into this multi-block dictionary? Yes?

Do you have any specific id items in mind?

Preferably, the mb-CIF_CORE dictionary (multi-block CIF_CORE, working title) would be quite small in comparison to the original CIF_CORE dictionary. It would only be responsible for redefining the desired categories from 'Set' to 'Loop' and introducing the additional key data items. Hopefully, this can be done by simply importing CIF_CORE into mb-CIF_CORE under the Ignore mode (a similar things is seemingly already being done with the REFLN category in the CIF_POW dictionary). If this works as I image it to work, then there is no need to explicitly move any items from CIF_CORE to mb-CIF_CORE at this point (but does not prevent us from doing so once we have a more stable version of mb-CIF_CORE). Does this make sense?

rowlesmr commented 12 months ago

Is it necessary to redefine Set categories to Loop, or just to introduce new Set category keys and relationships?

Eg Structure, something related to the collected data, something about the specimen...

That would maintain maximum compatibility with core, while enabling a single set of database tables to be constructed.

jamesrhester commented 12 months ago

No Set categories ever become Loop categories, as the meaning of a Set category is that, under the default schema only single values of data names in this category are allowed in a single data block. If a Set category is provided with one or more key data names, then it is possible to combine values appearing in different data blocks. This combining of values is notional, that is, it is possible to distribute them into a relational schema, or to create a single data block, but there is no requirement to actually do so.

Under the non-default schema, all categories with defined key data names are treated as Loop categories, that is, multiple values for data names can be present in a single data block.

jamesrhester commented 12 months ago

Do you have any specific id items in mind?

Anything where a Set category has been provided with key data names. In 3.2.0 that is DIFFRN and EXPTL_CRYSTAL, allowing multiple conditions and multiple crystals to be captured by using multiple data blocks. So those categories and key data names and child data names would go in the new dictionary.

Preferably, the mb-CIF_CORE dictionary (multi-block CIF_CORE, working title) would be quite small in comparison to the original CIF_CORE dictionary. It would only be responsible for redefining the desired categories from 'Set' to 'Loop' and introducing the additional key data items. Hopefully, this can be done by simply importing CIF_CORE into mb-CIF_CORE under the Ignore mode (a similar things is seemingly already being done with the REFLN category in the CIF_POW dictionary). If this works as I image it to work, then there is no need to explicitly move any items from CIF_CORE to mb-CIF_CORE at this point (but does not prevent us from doing so once we have a more stable version of mb-CIF_CORE). Does this make sense?

Yes, it would be much, much smaller and make it very easy to see the multi-block relationships. It should indeed work as you imagine using imports. Starting by moving the above two category redefinitions into the multi-block dictionary might be a good way to begin; we can leave the original definitions in the core version until after the multi-block dictionary has been released if we are concerned about dropping definitions.

rowlesmr commented 12 months ago

Can we still repurpose STRUCTURE in core, just without the key?

rowlesmr commented 12 months ago

We'll need to be careful to differentiate between structure, specimen, and crystal, as they are sometimes used as synonyms in CORE, but mean very different things in powder.

For instance, ATOM_ANALYTICAL was put in to give a place to record elemental information about the specimen, but it is within ATOM, which can be construed as being a structure, esp. as ATOM_SITE has a structure id. CHEMICAL refers to the compound under study, but describes as being consistent with unit cell etc, so is really a structure.

jamesrhester commented 11 months ago

The ATOM and STRUCTURE categories are semantically "empty" categories that are purely there for organisational purposes and do not carry any implications for data processing. So we are free to rename them, shuffle them, remove them, change their children and so on. The only caveat is that we don't want the next edition of Volume G referring to them in their current role if we have subsequently changed something, so if we do want to repurpose them we should decide on that soon.

I proposed repurposing the STRUCTURE category as that name seemed like one that was too good to leave as a largely pointless organisational category.

jamesrhester commented 9 months ago

Closing this issue as the new multiblock dictionary has been created.