Repurpose the `STRUCTURE` category

jamesrhester commented 1 year ago

Closes #442

Repurpose STRUCTURE category to make it possible to reference a structure, and to describe the components of a structure.

This pull request makes it possible to reference a "structure" in the following steps:

All categories describing the space group are assigned a "_space_group.id" key data name or pointer. This makes explicit the fact that they all relate to the same space group.
1. The STRUCTURE category now holds a description of structure, which consists of a pointer to the space group and to the diffraction conditions that it relates to.
The ATOM_SITE and CELL categories now include a key data name pointer to _structure.id, identifying which structure an atom or cell relates to.
More controversially, _cell.diffrn_id has been replaced by _structure.diffrn_id. To strictly follow our rules, it should be deprecated, but it is unlikely to have been used since the release.

As re-ordering was necessary, the first commit in the series is the easiest one to read in order to understand the changes.

vaitkus commented 1 year ago

As discussed in #442, I do not think that the introduction of structure id's is the right way to go as it highly complicates the data representation model (i.e. makes it less human-readable).

That being said, if we are going with this model, shouldn't all of the derivative data categories like GEOM_ANGLE, GEOM_BOND, etc. also be re-keyed?

jamesrhester commented 1 year ago

That being said, if we are going with this model, shouldn't all of the derivative data categories like GEOM_ANGLE, GEOM_BOND, etc. also be re-keyed?

Yes, they should also have a pointer to _structure.id. I can add those if/when this PR is accepted. There are also implications for powder: now PD_PHASE can have a pointer to _structure.id to identify the structural information for the phase. For modulated structures and magnetism there will also be implications, as magnetic structures have a child category of ATOM_SITE for magnetic moments.

If readability or apparent complexity is a problem, perhaps all of these "Set category keys" could be moved into a "multi-block ready" dictionary, which would be imported by any dictionary wishing to describe data using more than one data block. Then the core dictionary would be strictly for single-block standalone data.

rowlesmr commented 1 year ago

3. The ATOM_SITE and CELL categories now include a key data name pointer to _structure.id, identifying which structure an atom or cell relates to.

Shouldn't these be linked names?

.

Why is SPACE_GROUP special, inasmuch as it doesn't have a structure_id, and STRUCTURE has a space_group_id? It would be nice just to be able to wang a _structure.id in there and it all work, without having to also add a _space_group.id. This also means that we need to worry about _space_group.id being unique, not just _structure.id.

.

Having said that, after writing out an example, I can see why you'd want to have a _space_group.id; the symops et al do actually belong to the space group, not the structure, and it does give you the ability to not repeat space group info if multiple structures happen to have the same space group.

.

On Set category keys (or single packet Loops), is there a rule saying that if a key is not present in a block, you get an auto-generated one for free?

jamesrhester commented 1 year ago

On Set category keys (or single packet Loops), is there a rule saying that if a key is not present in a block, you get an auto-generated one for free?

I think that would be a harmless rule to have. Maybe not as useful as it looks, because somewhere else that item will be referenced, otherwise there's no point to having it identified in the first place. So then that "other place" will need to know what the ID is. For example, if a parent structure wants to reference a child structure, or a phase wants to identify what structure it has.

Having said that, after https://github.com/COMCIFS/cif_core/issues/442#issuecomment-1630845119, I can see why you'd want to have a _space_group.id; the symops et al do actually belong to the space group, not the structure, and it does give you the ability to not repeat space group info if multiple structures happen to have the same space group.

Yes, that was my thinking. Identical cells and atomic positions are unlikely to arise in a single dataset, whereas identical space groups are sufficiently likely (multi-temperature etc.) that there is a lot of economy in giving a space group its own ID. Of course the H-M symbol or IT number is insufficient for this.

The ATOM_SITE and CELL categories now include a key data name pointer to _structure.id, identifying which structure an atom or cell relates to.

Shouldn't these be linked names?

There is possibly a bit too much flexibility currently in that a structure could be created that had only a cell or only atom sites. But retaining this flexibility could be worthwhile, as you might be able to list the atoms in a unit cell without knowing anything else, maybe in a theoretical calculations. Adding a _cell.id and then pointing to it from atom_site is an alternative.

rowlesmr commented 1 year ago

On Set category keys (or single packet Loops), is there a rule saying that if a key is not present in a block, you get an auto-generated one for free?

I think that would be a harmless rule to have. Maybe not as useful as it looks, because somewhere else that item will be referenced, otherwise there's no point to having it identified in the first place. So then that "other place" will need to know what the ID is. For example, if a parent structure wants to reference a child structure, or a phase wants to identify what structure it has.

I'm thinking more along the lines of a _structure.space_group_id-type use, a la:

data_block
_structure.id A
_space_group.IT_number 62

I don't need to give an explicit _space_group.id. It gets auto-generated and autolinked to _structure.space_group_id. In this exact case you won't get the benefit of eliding exact space group duplicates, but this is just an indicative example.

jamesrhester commented 1 year ago

With key data names the principle is that their values can be assumed if the parent data name has a single value. However, for something like _structure.space_group_id, which is not a key data name, that rule doesn't exist. To see why, consider _cell_measurement.condition_id, which is a child data name of _diffrn.id which exists specifically to allow pointing to a different _diffrn.id to the one in the current block. We have provided a default, that does point to a key data name and will therefore benefit from the "single value" rule.

So I think for non-key data names like _structure.space_group_id, where it is possible that space group information is in a separate data block, the best we can do is to provide a default dREL method, and if ambiguity is present that default will simply return unknown.

_structure.space_group_id = space_group.id

can fail if there is no way to obtain a unique row of space_group using common key data names. Currently, that is any time that there is more than one space group in context.

jamesrhester commented 1 year ago

This pull request should now be recreated relative to the multiblock dictionary.

COMCIFS / cif_core

Repurpose the `STRUCTURE` category #445