Open jamesrhester opened 1 year ago
A few further thoughts after experimenting a bit.
The direction of the functional arrows has to be decided. That is, (a) given a _structure.id
will that imply a particular cell/space group/set of atomic sites (ie a functional mapping from _structure.id
) and/or (b) given a cell/space group/list of atomic sites does that imply a particular _structure.id
?
Either are possible. If we want to map from structure to cell etc., then we have to have a way of identifying a cell/space group/list of atomic sites, and then we can economise space if any of these happen to be identical for different structures. If we want to map from cell etc to structure, then we must add an additional key data name to each category pointing to _structure.id
, and we can't take advantage of identical space groups/cells/atom site lists to
economise space.
Here is how the mapping from structure -> cell etc. works after blocks are merged:
loop_
_structure.id
_structure.cell_id
_structure.space_group_id
_structure.atomic_site_id
S1 C1 SG1 AS1
S2 C2 SG2 AS2
loop_
_atom_site.label
_atom_site.site_list_id
_atom_site.fract_xyz #or whatever, don't make me look it up
O AS1 [0 0.5 0]
C AS1 [0 0 0]
O AS2 [0.2 0.3 0.4]
La AS2 [0.125 0.5 0]
loop_
_cell.id
_cell.length_a
C1 5.2
C2 4.8
loop_
_space_group.id
_space_group.IT_number
SG1 223
SG2 221
The alternative (cell etc -> structure):
loop_
_structure.id
S1 S2
loop_
_atom_site.label
_atom_site.structure_id
_atom_site.fract_xyz #or whatever, don't make me look it up
O S1 [0 0.5 0]
C S1 [0 0 0]
O S2 [0.2 0.3 0.4]
La S2 [0.125 0.5 0]
loop_
_cell.structure_id
_cell.length_a
S1 5.2
S2 4.8
loop_
_space_group.structure_id
_space_group.IT_number
S1 223
S2 221
# and space group symop/Wyckoff/generator
The only serious space saving relates to space group, as during a cooling/heating run the space group may remain the same for many temperatures. So I think we create a separate space group identifier, but cell and atom site simply have a pointer to structure. I will prepare a pull request along those lines.
A final issue now relates to CELL
. We recently added _cell.diffrn_id
as the key for CELL
, and following the proposal in the previous comment we additionally now add _cell.structure_id
. Either one of these uniquely identifies a cell and could be the key. Ideally, we would have _structure.diffrn_id
to identify the conditions and sample that the structure was measured under, and remove _cell.diffrn_id
completely, as it is redundant because it can be found via _cell.structure_id
.
The reason I don't like _cell.diffrn_id
as a data name now is that it implies that the cell is a function of diffraction conditions, rather than the structure overall being a function of the diffraction conditions. Also, if we have calculated structures then a pointer to diffraction conditions may not make any sense.
(Edit): Of course, a cell can be determined without knowing the full structure. This is catered for by simply having "." for the space group in the structure description.
This seems quite a major change since it requires to re-key a lot of the main categories to include the structure id. Couldn't the same result be achieved by having the two structure descriptions in separate data blocks and relating them using the AUDIT_LINK
category (or some other newly added data items)?
Currently, a single data block can be expected to represent a single structure (at least in the case of simple non-powder experiments), but with this change pretty much all categories would have to become looped (e.g. what if we use slightly different parameters in the refinement, etc.)
I will have to have a think on a (the?) "correct" mapping.
Couldn't the same result be achieved by having the two structure descriptions in separate data blocks and relating them using the
AUDIT_LINK
category (or some other newly added data items)?
My reading of the example is that this is what a single block would look like after merging all of the individual blocks. My main wish is to be able construct a single set of (database) tables from a grouping of related data blocks.
For single-crystal stuff, this is probably redundant, as you can fit most things into one block, but is essential for powder.
I'm not sure I understand @vaitkus's objection, as the idea is that structures would indeed appear in separate data blocks. The example was purely to show what the blocks would look like after (notional) merging. Before merging, they would look like:
data_block1
_structure.id S1
_cell.length_a 5.2
_space_group.IT_number 223
loop_
_atom_site.label
_atom_site.fract_xyz #or whatever, don't make me look it up
O [0 0.5 0]
C [0 0 0]
data_block2
_structure.id S2
_cell.length_a 4.8
_space_group.IT_number 221
loop_
_atom_site.label
_atom_site.fract_xyz
O [0.2 0.3 0.4]
La [0.125 0.5 0]
which doesn't appear to be a major change to me.
In any case, there is no concept of "a data block" in the underlying, purely relational, data model - as assumed by the multi-block principles and dREL. So references to data blocks in data names from the AUDIT_LINK
category refer simply to values of data block identifiers, and cannot refer to the whole group of values that belong to a particular data block.
All that these changes are doing is making explicit relationships that were, up until now, implicit. Calculations with atomic positions used "the" unit cell parameters and "the" space group, understood as those appearing in the same data block.
While the re-keying may look dramatic, because STRUCTURE
is a Set category, there is zero need to actually use these data names in data blocks for the default _audit.schema
, as seen in the above example.
I think I prefer the alternative method. It's more human-readable, as I can just look at (for example) an atom site and know which structure it belongs to.
loop_
_structure.id
S1 S2
loop_
_atom_site.label
_atom_site.structure_id
_atom_site.fract_xyz #or whatever, don't make me look it up
O S1 [0 0.5 0]
#...
There are less different identifiers to have, and, to my mind, this is saying "there is a structure, S1, and these cell parameters, space group, atoms... belong to it.". The other method has another level of indirection to it.
I think that the implementation is also a little simpler: just add a _*.structure_id
to a category and you're good to go, as opposed to adding an _*.id
and a _structure.*_id
.
A final issue now relates to
CELL
. We recently added_cell.diffrn_id
as the key forCELL
, and following the proposal in the previous comment we additionally now add_cell.structure_id
. Either one of these uniquely identifies a cell and could be the key.
Is the diffrnid in the nearly-released version? Could it just be removed?
I would see _*.structure_id
as the ideal key dataname for these Set
categories.
@jamesrhester, ok, I did not initially understand that the proposed representation is only an alternative and not a direct replacement.
However, maybe this re-keying could indeed first be carried out in a separate dictionary as suggested in the discussion of PR #445? (see https://github.com/COMCIFS/cif_core/pull/445#issuecomment-1630113322) which can then be imported as needed? This would give us more freedom to experiment with the relationships of this model without affecting the CIF_CORE
dictionary too much. When we see that this model is mature enough, it can be merged with the CIF_CORE
if so desired. Maybe I am too paranoid, but I do feel that there are some underlying complexities here that will only become clear once we try to use the model with real world data.
which doesn't appear to be a major change to me.
It is a major change from the point of existing software. Programs that known nothing about the existence of _structure.id
and related fields will simply assume that all atom sites belong to the same structure, etc. Of course, this can be eventually fixed by the software maintainers, but changes like these must be very clearly communicated to the community.
@rowlesmr, @jamesrhester The only tagged version 3.0.13
did not yet have the _cell.diffrn_id
data item while the release candidate version rc-3.2.0
does have it. However, I really doubt that any software has implemented the proper handling of this item, so it should be quite save to remove it if needed.
However, as I recall _cell.diffrn_id
was mainly used to describe experiments in which the cell was measured under a different set of conditions than the atomic coordinates (see the cell-measurement-multi-block.cif
and cell-measurement-single-block.cif
files under examples/
). How would this be represented under the _structure.id
model?
Just making up a slightly more complicated CIF example:
###################################
#
# Beginning of the CIF file
#
###################################
data_blockname_one
_structure.id A #Req'd by the concept of having a "structure". Must be unique.
_cell.length_a 5.4469(5)
_cell.length_b 5.4469(5)
_cell.length_c 5.4469(5)
_cell.angle_alpha 90
_cell.angle_beta 90
_cell.angle_gamma 90
_cell.volume 161.61(4)
_cell.formula_units_Z 4
_space_group.id 1 #Req'd by James' PR. Must be unique. Can be the same if
# representing the same space group.
_space_group.crystal_system cubic
_space_group.name_H-M_alt Fm-3m
loop_
_space_group_symop_id
_space_group_symop_operation_xyz
1 'x, y, z '
2 '-x, -y, z '
#...
191 'x, y+1/2, -z+1/2 '
192 '-x, -y+1/2, -z+1/2 '
loop_
_atom_site.label
_atom_site.type_symbol
_atom_site.fract_xyz
_atom_site.B_iso_or_equiv
Mn1 Mn+2 [0 0 0] 0.2(5)
Se1 Se [0.5 0.5 0.5] 0.4(3)
loop_
_atom_site_aniso.label
_atom_site_aniso.b_11
_atom_site_aniso.b_12
_atom_site_aniso.b_13
_atom_site_aniso.b_22
_atom_site_aniso.b_23
_atom_site_aniso.b_33
Mn1 0.2 0.05 0.08 0.2 0.03 0.2
data_blockname2
_structure.id B #Req'd by the concept of having a "structure". Must be unique.
_cell.length_a 3.0205(4)
_cell.length_b 3.0205(4)
_cell.length_c 3.0205(4)
_cell.angle_alpha 90
_cell.angle_beta 90
_cell.angle_gamma 90
_cell.volume 27.558(12)
_cell.formula_units_Z 2
_space_group.id 2 #Req'd by James' PR. Must be unique. Can be the same if
# representing the same space group.
_space_group.crystal_system cubic
_space_group.name_H-M_alt Im-3m
loop_
_space_group.symop_id
_space_group.symop_operation_xyz
1 'x, y, z '
2 '-x, -y, z '
#...
95 'x+1/2, y+1/2, -z+1/2 '
96 '-x+1/2, -y+1/2, -z+1/2 '
loop_
_atom_site.label
_atom_site.type_symbol
_atom_site.fract_xyz
_atom_site.B_iso_or_equiv
V1 V [0 0 0] 0.5(2)
###################################
#
# End of the CIF file
#
###################################
As per James' PR (AFAIK)
###################################
#
# A representation of a merged datablock.
# It shouldn't actually be used this way to construct
# a CIF file, but maps out how the relational tables
# would be populated.
#
###################################
#this is the merged datablock assuming _structure.space_group_id exists
data_merged_hester
loop_
_structure.id
_structure.space_group_id
A 1
B 2
loop_
_cell.structure_id
_cell.length_a
_cell.length_a_su
_cell.length_b
_cell.length_b_su
_cell.length_c
_cell.length_c_su
_cell.angle_alpha
_cell.angle_beta
_cell.angle_gamma
_cell.volume
_cell.volume_su
_cell.formula_units_Z
A 5.4469 0.0005 5.4469 0.0005 5.4469 0.0005 90 90 90 161.61 0.04 4
B 3.0205 0.0004 3.0205 0.0004 3.0205 0.0004 90 90 90 27.558 0.012 2
# if both structures had the same SG, then you only need to include the one SG
loop_
_space_group.id
_space_group.crystal_system
_space_group.name_H-M_alt
1 cubic Fm-3m
2 cubic Im-3m
loop_
_space_group_symop.space_group_id
_space_group_symop.id
_space_group_symop.operation_xyz
1 1 'x, y, z '
1 2 '-x, -y, z '
#...
1 191 'x, y+1/2, -z+1/2 '
1 192 '-x, -y+1/2, -z+1/2 '
2 1 'x, y, z '
2 2 '-x, -y, z '
#...
2 95 'x+1/2, y+1/2, -z+1/2 '
2 96 '-x+1/2, -y+1/2, -z+1/2 '
loop_
_atom_site.structure_id
_atom_site.label
_atom_site.type_symbol
_atom_site.fract_xyz
_atom_site.B_iso_or_equiv
_atom_site.B_iso_or_equiv_su
A Mn1 Mn+2 [0 0 0] 0.2 0.5
A Se1 Se [0.5 0.5 0.5] 0.4 0.3
B V1 V [0 0 0] 0.5 0.2
loop_
_atom_site_aniso.structure_id
_atom_site_aniso.label
_atom_site_aniso.b_11
_atom_site_aniso.b_12
_atom_site_aniso.b_13
_atom_site_aniso.b_22
_atom_site_aniso.b_23
_atom_site_aniso.b_33
A Mn1 0.2 0.05 0.08 0.2 0.03 0.2
My initial thoughts on how it would work.
###################################
#
# Another representation of a merged datablock.
# It shouldn't actually be used this way to construct
# a CIF file, but maps out how the relational tables
# would be populated.
#
###################################
#this is the merged datablock assuming _space_group_*.structure_id exists
data_merged_rowles
loop_
_structure.id
A
B
loop_
_cell.structure_id
_cell.length_a
_cell.length_a_su
_cell.length_b
_cell.length_b_su
_cell.length_c
_cell.length_c_su
_cell.angle_alpha
_cell.angle_beta
_cell.angle_gamma
_cell.volume
_cell.volume_su
_cell.formula_units_Z
A 5.4469 0.0005 5.4469 0.0005 5.4469 0.0005 90 90 90 161.61 0.04 4
B 3.0205 0.0004 3.0205 0.0004 3.0205 0.0004 90 90 90 27.558 0.012 2
# if both structures had the same SG, you still need to include both SGs
loop_
_space_group.structure_id
_space_group.crystal_system
_space_group.name_H-M_alt
A cubic Fm-3m
B cubic Im-3m
loop_
_space_group_symop.structure_id
_space_group_symop.id
_space_group_symop.operation_xyz
A 1 'x, y, z '
A 2 '-x, -y, z '
#...
A 191 'x, y+1/2, -z+1/2 '
A 192 '-x, -y+1/2, -z+1/2 '
B 1 'x, y, z '
B 2 '-x, -y, z '
#...
B 95 'x+1/2, y+1/2, -z+1/2 '
B 96 '-x+1/2, -y+1/2, -z+1/2 '
loop_
_atom_site.structure_id
_atom_site.label
_atom_site.type_symbol
_atom_site.fract_xyz
_atom_site.B_iso_or_equiv
_atom_site.B_iso_or_equiv_su
A Mn1 Mn+2 [0 0 0] 0.2 0.5
A Se1 Se [0.5 0.5 0.5] 0.4 0.3
B V1 V [0 0 0] 0.5 0.2
loop_
_atom_site_aniso.structure_id
_atom_site_aniso.label
_atom_site_aniso.b_11
_atom_site_aniso.b_12
_atom_site_aniso.b_13
_atom_site_aniso.b_22
_atom_site_aniso.b_23
_atom_site_aniso.b_33
A Mn1 0.2 0.05 0.08 0.2 0.03 0.2
Programs that known nothing about the existence of
_structure.id
and related fields will simply assume that all atom sites belong to the same structure, etc. Of course, this can be eventually fixed by the software maintainers, but changes like these must be very clearly communicated to the community.
Shouldn't the presence of either Set
categories being looped or a non-default _audit.schema
be some indication that something is awry?
I do agree that it must be communicated clearly, but unless you're working in multi-block CIF files, I don't think anything changes?
Shouldn't the presence of either
Set
categories being looped or a non-default_audit.schema
be some indication that something is awry?I do agree that it must be communicated clearly, but unless you're working in multi-block CIF files, I don't think anything changes?
I am looking at this from the position of traditional CIF_1.1/DDL1 software which is not aware of the _audit.schema
and expects one structure per data block. Furthermore, it is quite easy to miss a Set category that is looped when processing files in bulk unless you explicitly check for that (e.g. using a validating parser or a separate validator).
For example, the parser that we use in our COD pipeline puts looped and unlooped values in the same array data structure since it allows to access the data in a more uniform way. Under this approach, single valued items become arrays with a single element. Other programs that use this parser only deal with the first element of such arrays and would thus not notice any additional array elements. I have also encountered other CIF programs that behave in a similar way.
The points raised by @vaitkus were discussed in 2016 when developing this approach: please see here and links therein.
In practice we expect everyone to use the default _audit.schema
, in which case data blocks, even if part of a larger collection, will be correctly interpreted within the assumptions of the reading software, that is, one structure among a collection of structures will be correctly understood as a structure (e.g. correct density calculated), but its relationships to other structures in the file will be unknown.
Non-default _audit.schema
are there for the "problem children" <cough>
powder</cough>
who want to have summary blocks collecting information together. CIF authors are also motivated to make sure that their outputs are readable by software, so I anticipate non-default _audit.schema
appearing only in very controlled circumstances.
Just to confirm @rowlesmr 's example above demonstrates exactly how it should work.
A need has been expressed to refer to alternative structural descriptions, e.g. for a child structure to refer to a parent structure, where the child structure may be the result of a particular symmetry mode changing. This means we need a "structure id" from which the space group, cell and atomic sites can be obtained.
Currently we have the
STRUCTURE
category, which is purely a housekeeping construct (no datanames of its own) and includes things likeATOM_ANALYTICAL
andREFINE
which are not directly related to describing the structure.So I think we need to repurpose
STRUCTURE
(as it is not "used" as such) to be that structure description category, add an "id" data name to it, and reparent those categories that don't belong in the newSTRUCTURE
to the top-level category until such time as we discover they also need to be grouped.Originally came up in https://github.com/COMCIFS/Powder_Dictionary/issues/164.