COMCIFS / cif_core

The IUCr CIF core dictionary
15 stars 9 forks source link

Create a category to hold structure description items #442

Open jamesrhester opened 1 year ago

jamesrhester commented 1 year ago

A need has been expressed to refer to alternative structural descriptions, e.g. for a child structure to refer to a parent structure, where the child structure may be the result of a particular symmetry mode changing. This means we need a "structure id" from which the space group, cell and atomic sites can be obtained.

Currently we have the STRUCTURE category, which is purely a housekeeping construct (no datanames of its own) and includes things like ATOM_ANALYTICAL and REFINE which are not directly related to describing the structure.

So I think we need to repurpose STRUCTURE (as it is not "used" as such) to be that structure description category, add an "id" data name to it, and reparent those categories that don't belong in the new STRUCTURE to the top-level category until such time as we discover they also need to be grouped.

Originally came up in https://github.com/COMCIFS/Powder_Dictionary/issues/164.

jamesrhester commented 1 year ago

A few further thoughts after experimenting a bit.

The direction of the functional arrows has to be decided. That is, (a) given a _structure.id will that imply a particular cell/space group/set of atomic sites (ie a functional mapping from _structure.id) and/or (b) given a cell/space group/list of atomic sites does that imply a particular _structure.id?

Either are possible. If we want to map from structure to cell etc., then we have to have a way of identifying a cell/space group/list of atomic sites, and then we can economise space if any of these happen to be identical for different structures. If we want to map from cell etc to structure, then we must add an additional key data name to each category pointing to _structure.id, and we can't take advantage of identical space groups/cells/atom site lists to economise space.

Here is how the mapping from structure -> cell etc. works after blocks are merged:

loop_
_structure.id
_structure.cell_id
_structure.space_group_id
_structure.atomic_site_id
S1 C1 SG1 AS1
S2 C2 SG2 AS2

loop_
_atom_site.label
_atom_site.site_list_id
_atom_site.fract_xyz #or whatever, don't make me look it up
O AS1 [0 0.5 0]
C  AS1 [0 0 0]
O AS2 [0.2 0.3 0.4]
La AS2 [0.125 0.5 0]

loop_
_cell.id
_cell.length_a
C1 5.2
C2 4.8

loop_
_space_group.id
_space_group.IT_number
SG1 223
SG2 221

The alternative (cell etc -> structure):

loop_
_structure.id
S1 S2

loop_
_atom_site.label
_atom_site.structure_id
_atom_site.fract_xyz #or whatever, don't make me look it up
O S1 [0 0.5 0]
C  S1 [0 0 0]
O S2 [0.2 0.3 0.4]
La S2 [0.125 0.5 0]

loop_
_cell.structure_id
_cell.length_a
S1 5.2
S2 4.8

loop_
_space_group.structure_id
_space_group.IT_number
S1 223
S2 221

# and space group symop/Wyckoff/generator

Conclusion

The only serious space saving relates to space group, as during a cooling/heating run the space group may remain the same for many temperatures. So I think we create a separate space group identifier, but cell and atom site simply have a pointer to structure. I will prepare a pull request along those lines.

jamesrhester commented 1 year ago

A final issue now relates to CELL. We recently added _cell.diffrn_id as the key for CELL, and following the proposal in the previous comment we additionally now add _cell.structure_id. Either one of these uniquely identifies a cell and could be the key. Ideally, we would have _structure.diffrn_id to identify the conditions and sample that the structure was measured under, and remove _cell.diffrn_id completely, as it is redundant because it can be found via _cell.structure_id.

The reason I don't like _cell.diffrn_id as a data name now is that it implies that the cell is a function of diffraction conditions, rather than the structure overall being a function of the diffraction conditions. Also, if we have calculated structures then a pointer to diffraction conditions may not make any sense.

(Edit): Of course, a cell can be determined without knowing the full structure. This is catered for by simply having "." for the space group in the structure description.

vaitkus commented 1 year ago

This seems quite a major change since it requires to re-key a lot of the main categories to include the structure id. Couldn't the same result be achieved by having the two structure descriptions in separate data blocks and relating them using the AUDIT_LINK category (or some other newly added data items)?

Currently, a single data block can be expected to represent a single structure (at least in the case of simple non-powder experiments), but with this change pretty much all categories would have to become looped (e.g. what if we use slightly different parameters in the refinement, etc.)

rowlesmr commented 1 year ago

I will have to have a think on a (the?) "correct" mapping.

Couldn't the same result be achieved by having the two structure descriptions in separate data blocks and relating them using the AUDIT_LINK category (or some other newly added data items)?

My reading of the example is that this is what a single block would look like after merging all of the individual blocks. My main wish is to be able construct a single set of (database) tables from a grouping of related data blocks.

For single-crystal stuff, this is probably redundant, as you can fit most things into one block, but is essential for powder.

jamesrhester commented 1 year ago

I'm not sure I understand @vaitkus's objection, as the idea is that structures would indeed appear in separate data blocks. The example was purely to show what the blocks would look like after (notional) merging. Before merging, they would look like:

data_block1

_structure.id S1
_cell.length_a 5.2
_space_group.IT_number 223

loop_
_atom_site.label
_atom_site.fract_xyz #or whatever, don't make me look it up
O [0 0.5 0]
C  [0 0 0]

data_block2

_structure.id S2
_cell.length_a 4.8
_space_group.IT_number 221

loop_
_atom_site.label
_atom_site.fract_xyz
O [0.2 0.3 0.4]
La [0.125 0.5 0]

which doesn't appear to be a major change to me.

In any case, there is no concept of "a data block" in the underlying, purely relational, data model - as assumed by the multi-block principles and dREL. So references to data blocks in data names from the AUDIT_LINK category refer simply to values of data block identifiers, and cannot refer to the whole group of values that belong to a particular data block.

All that these changes are doing is making explicit relationships that were, up until now, implicit. Calculations with atomic positions used "the" unit cell parameters and "the" space group, understood as those appearing in the same data block.

While the re-keying may look dramatic, because STRUCTURE is a Set category, there is zero need to actually use these data names in data blocks for the default _audit.schema, as seen in the above example.

rowlesmr commented 1 year ago

I think I prefer the alternative method. It's more human-readable, as I can just look at (for example) an atom site and know which structure it belongs to.

loop_
_structure.id
S1 S2

loop_
_atom_site.label
_atom_site.structure_id
_atom_site.fract_xyz #or whatever, don't make me look it up
O S1 [0 0.5 0]
#...

There are less different identifiers to have, and, to my mind, this is saying "there is a structure, S1, and these cell parameters, space group, atoms... belong to it.". The other method has another level of indirection to it.

I think that the implementation is also a little simpler: just add a _*.structure_id to a category and you're good to go, as opposed to adding an _*.id and a _structure.*_id.

rowlesmr commented 1 year ago

A final issue now relates to CELL. We recently added _cell.diffrn_id as the key for CELL, and following the proposal in the previous comment we additionally now add _cell.structure_id. Either one of these uniquely identifies a cell and could be the key.

Is the diffrnid in the nearly-released version? Could it just be removed?

I would see _*.structure_id as the ideal key dataname for these Set categories.

vaitkus commented 1 year ago

@jamesrhester, ok, I did not initially understand that the proposed representation is only an alternative and not a direct replacement.

However, maybe this re-keying could indeed first be carried out in a separate dictionary as suggested in the discussion of PR #445? (see https://github.com/COMCIFS/cif_core/pull/445#issuecomment-1630113322) which can then be imported as needed? This would give us more freedom to experiment with the relationships of this model without affecting the CIF_CORE dictionary too much. When we see that this model is mature enough, it can be merged with the CIF_CORE if so desired. Maybe I am too paranoid, but I do feel that there are some underlying complexities here that will only become clear once we try to use the model with real world data.

which doesn't appear to be a major change to me.

It is a major change from the point of existing software. Programs that known nothing about the existence of _structure.id and related fields will simply assume that all atom sites belong to the same structure, etc. Of course, this can be eventually fixed by the software maintainers, but changes like these must be very clearly communicated to the community.

vaitkus commented 1 year ago

@rowlesmr, @jamesrhester The only tagged version 3.0.13 did not yet have the _cell.diffrn_id data item while the release candidate version rc-3.2.0 does have it. However, I really doubt that any software has implemented the proper handling of this item, so it should be quite save to remove it if needed.

However, as I recall _cell.diffrn_id was mainly used to describe experiments in which the cell was measured under a different set of conditions than the atomic coordinates (see the cell-measurement-multi-block.cif and cell-measurement-single-block.cif files under examples/). How would this be represented under the _structure.id model?

rowlesmr commented 1 year ago

Just making up a slightly more complicated CIF example:


###################################
#
#  Beginning of the CIF file
#
###################################

data_blockname_one
_structure.id A   #Req'd by the concept of having a "structure". Must be unique.

_cell.length_a            5.4469(5)
_cell.length_b            5.4469(5)
_cell.length_c            5.4469(5)
_cell.angle_alpha        90
_cell.angle_beta         90
_cell.angle_gamma        90
_cell.volume            161.61(4)
_cell.formula_units_Z     4         

_space_group.id 1 #Req'd by James' PR. Must be unique. Can be the same if
                  # representing the same space group.
_space_group.crystal_system   cubic
_space_group.name_H-M_alt     Fm-3m

loop_
  _space_group_symop_id
  _space_group_symop_operation_xyz
    1 'x, y, z '
    2 '-x, -y, z '
  #...
  191 'x, y+1/2, -z+1/2 '
  192 '-x, -y+1/2, -z+1/2 '

loop_
  _atom_site.label
  _atom_site.type_symbol
  _atom_site.fract_xyz
  _atom_site.B_iso_or_equiv
Mn1 Mn+2 [0   0   0]   0.2(5)
Se1 Se   [0.5 0.5 0.5] 0.4(3)

loop_
  _atom_site_aniso.label
  _atom_site_aniso.b_11
  _atom_site_aniso.b_12
  _atom_site_aniso.b_13
  _atom_site_aniso.b_22
  _atom_site_aniso.b_23
  _atom_site_aniso.b_33
Mn1 0.2 0.05 0.08 0.2 0.03 0.2

data_blockname2
_structure.id B   #Req'd by the concept of having a "structure". Must be unique.

_cell.length_a            3.0205(4)
_cell.length_b            3.0205(4)
_cell.length_c            3.0205(4)
_cell.angle_alpha        90
_cell.angle_beta         90
_cell.angle_gamma        90
_cell.volume             27.558(12)
_cell.formula_units_Z     2        

_space_group.id 2 #Req'd by James' PR. Must be unique. Can be the same if
                  # representing the same space group.
_space_group.crystal_system cubic
_space_group.name_H-M_alt Im-3m
loop_
  _space_group.symop_id
  _space_group.symop_operation_xyz
    1 'x, y, z '
    2 '-x, -y, z '
   #...
   95 'x+1/2, y+1/2, -z+1/2 '
   96 '-x+1/2, -y+1/2, -z+1/2 '

loop_
  _atom_site.label
  _atom_site.type_symbol
  _atom_site.fract_xyz
  _atom_site.B_iso_or_equiv
V1 V [0 0 0] 0.5(2)

###################################
#
#  End of the CIF file
#
###################################

As per James' PR (AFAIK)

###################################
#
#  A representation of a merged datablock.
#  It shouldn't actually be used this way to construct
#  a CIF file, but maps out how the relational tables 
#  would be populated.
#
###################################

#this is the merged datablock assuming _structure.space_group_id exists
data_merged_hester
loop_
  _structure.id
  _structure.space_group_id
A 1
B 2

loop_
  _cell.structure_id
  _cell.length_a
  _cell.length_a_su
  _cell.length_b       
  _cell.length_b_su       
  _cell.length_c       
  _cell.length_c_su       
  _cell.angle_alpha    
  _cell.angle_beta     
  _cell.angle_gamma    
  _cell.volume         
  _cell.volume_su         
  _cell.formula_units_Z  
A  5.4469 0.0005 5.4469 0.0005 5.4469 0.0005 90 90 90 161.61  0.04   4        
B  3.0205 0.0004 3.0205 0.0004 3.0205 0.0004 90 90 90  27.558 0.012  2        

# if both structures had the same SG, then you only need to include the one SG
loop_
  _space_group.id
  _space_group.crystal_system     
  _space_group.name_H-M_alt     
1 cubic Fm-3m 
2 cubic Im-3m

loop_
  _space_group_symop.space_group_id
  _space_group_symop.id
  _space_group_symop.operation_xyz
1   1 'x, y, z '
1   2 '-x, -y, z '
#...
1 191 'x, y+1/2, -z+1/2 '
1 192 '-x, -y+1/2, -z+1/2 '
2   1 'x, y, z '
2   2 '-x, -y, z '
#...
2  95 'x+1/2, y+1/2, -z+1/2 '
2  96 '-x+1/2, -y+1/2, -z+1/2 '

loop_
  _atom_site.structure_id
  _atom_site.label
  _atom_site.type_symbol
  _atom_site.fract_xyz
  _atom_site.B_iso_or_equiv
  _atom_site.B_iso_or_equiv_su
A Mn1 Mn+2 [0   0   0]   0.2 0.5
A Se1 Se   [0.5 0.5 0.5] 0.4 0.3
B V1  V    [0   0   0]   0.5 0.2

loop_
  _atom_site_aniso.structure_id
  _atom_site_aniso.label
  _atom_site_aniso.b_11
  _atom_site_aniso.b_12
  _atom_site_aniso.b_13
  _atom_site_aniso.b_22
  _atom_site_aniso.b_23
  _atom_site_aniso.b_33
A Mn1 0.2 0.05 0.08 0.2 0.03 0.2

My initial thoughts on how it would work.

###################################
#
#  Another representation of a merged datablock.
#  It shouldn't actually be used this way to construct
#  a CIF file, but maps out how the relational tables 
#  would be populated.
#
###################################

#this is the merged datablock assuming _space_group_*.structure_id exists
data_merged_rowles
loop_
  _structure.id
A
B

loop_
  _cell.structure_id
  _cell.length_a
  _cell.length_a_su
  _cell.length_b       
  _cell.length_b_su       
  _cell.length_c       
  _cell.length_c_su       
  _cell.angle_alpha    
  _cell.angle_beta     
  _cell.angle_gamma    
  _cell.volume         
  _cell.volume_su         
  _cell.formula_units_Z  
A  5.4469 0.0005 5.4469 0.0005 5.4469 0.0005 90 90 90 161.61  0.04   4        
B  3.0205 0.0004 3.0205 0.0004 3.0205 0.0004 90 90 90  27.558 0.012  2        

# if both structures had the same SG, you still need to include both SGs
loop_
  _space_group.structure_id
  _space_group.crystal_system     
  _space_group.name_H-M_alt     
A cubic Fm-3m
B cubic Im-3m

loop_
  _space_group_symop.structure_id
  _space_group_symop.id
  _space_group_symop.operation_xyz
A   1 'x, y, z '
A   2 '-x, -y, z '
#...
A 191 'x, y+1/2, -z+1/2 '
A 192 '-x, -y+1/2, -z+1/2 '
B   1 'x, y, z '
B   2 '-x, -y, z '
#...
B  95 'x+1/2, y+1/2, -z+1/2 '
B  96 '-x+1/2, -y+1/2, -z+1/2 '

loop_
  _atom_site.structure_id
  _atom_site.label
  _atom_site.type_symbol
  _atom_site.fract_xyz
  _atom_site.B_iso_or_equiv
  _atom_site.B_iso_or_equiv_su
A Mn1 Mn+2 [0   0   0]   0.2 0.5
A Se1 Se   [0.5 0.5 0.5] 0.4 0.3
B V1  V    [0   0   0]   0.5 0.2

loop_
  _atom_site_aniso.structure_id
  _atom_site_aniso.label
  _atom_site_aniso.b_11
  _atom_site_aniso.b_12
  _atom_site_aniso.b_13
  _atom_site_aniso.b_22
  _atom_site_aniso.b_23
  _atom_site_aniso.b_33
A Mn1 0.2 0.05 0.08 0.2 0.03 0.2
rowlesmr commented 1 year ago

Programs that known nothing about the existence of _structure.id and related fields will simply assume that all atom sites belong to the same structure, etc. Of course, this can be eventually fixed by the software maintainers, but changes like these must be very clearly communicated to the community.

Shouldn't the presence of either Set categories being looped or a non-default _audit.schema be some indication that something is awry?

I do agree that it must be communicated clearly, but unless you're working in multi-block CIF files, I don't think anything changes?

vaitkus commented 1 year ago

Shouldn't the presence of either Set categories being looped or a non-default _audit.schema be some indication that something is awry?

I do agree that it must be communicated clearly, but unless you're working in multi-block CIF files, I don't think anything changes?

I am looking at this from the position of traditional CIF_1.1/DDL1 software which is not aware of the _audit.schema and expects one structure per data block. Furthermore, it is quite easy to miss a Set category that is looped when processing files in bulk unless you explicitly check for that (e.g. using a validating parser or a separate validator).

For example, the parser that we use in our COD pipeline puts looped and unlooped values in the same array data structure since it allows to access the data in a more uniform way. Under this approach, single valued items become arrays with a single element. Other programs that use this parser only deal with the first element of such arrays and would thus not notice any additional array elements. I have also encountered other CIF programs that behave in a similar way.

jamesrhester commented 1 year ago

The points raised by @vaitkus were discussed in 2016 when developing this approach: please see here and links therein.

In practice we expect everyone to use the default _audit.schema, in which case data blocks, even if part of a larger collection, will be correctly interpreted within the assumptions of the reading software, that is, one structure among a collection of structures will be correctly understood as a structure (e.g. correct density calculated), but its relationships to other structures in the file will be unknown.

Non-default _audit.schema are there for the "problem children" <cough>powder</cough> who want to have summary blocks collecting information together. CIF authors are also motivated to make sure that their outputs are readable by software, so I anticipate non-default _audit.schema appearing only in very controlled circumstances.

jamesrhester commented 1 year ago

Just to confirm @rowlesmr 's example above demonstrates exactly how it should work.