COMCIFS / MultiBlock_Dictionary

Definitions describing data stored in multiple containers
1 stars 3 forks source link

add structure.id #6

Closed rowlesmr closed 3 months ago

rowlesmr commented 8 months ago

Will close #3

I redid the PR, as I wanted to work through things in my head.

There are now:

cell.diffrn_id and cell_measurement.diffrn_id have been removed.

Modulated and magnetic structures need to be looked at.

rowlesmr commented 8 months ago

An example.

I need to make up others with separate cell measurement conditions, and magnetic and modulated structures.

Please note that (AFAIK) you don't need to explicitly give _structure.id or _space_group.id in this example, as all the relevant information is in a single block for each structure. I put the diffraction conditions in a separate block to show how you can link with _structure.diffrn_id.

This links in with my idea (which I can't find the comment for) where single Set keys are autogenerated if they don't exist. (got it: https://github.com/COMCIFS/cif_core/pull/445)

###################################
#
#  Beginning of the CIF file
#
###################################

data_conditions
_diffrn.id DIFCON_1
_diffrn.ambient_temperature  900

data_blockname_one
_structure.id A   #Must be unique.
_structure.diffrn_id  DIFCON_1

_cell.length_a            5.4469
_cell.length_b            5.4469
_cell.length_c            5.4469
_cell.angle_alpha        90
_cell.angle_beta         90
_cell.angle_gamma        90
_cell.volume            161.61
_cell.formula_units_Z     4         

_space_group.id 1 #Must be unique. Can be the same if representing the same space group in the same setting
_space_group.crystal_system   cubic
_space_group.name_H-M_alt     Fm-3m

loop_
  _space_group_symop_id
  _space_group_symop_operation_xyz
    1 'x, y, z '
    2 '-x, -y, z '
  #...
  191 'x, y+1/2, -z+1/2 '
  192 '-x, -y+1/2, -z+1/2 '

loop_
  _atom_site.label
  _atom_site.type_symbol
  _atom_site.fract_xyz
  _atom_site.B_iso_or_equiv
Mn1 Mn+2 [0   0   0]   0.2
Se1 Se   [0.5 0.5 0.5] 0.4

loop_
  _atom_site_aniso.label
  _atom_site_aniso.b_11
  _atom_site_aniso.b_12
  _atom_site_aniso.b_13
  _atom_site_aniso.b_22
  _atom_site_aniso.b_23
  _atom_site_aniso.b_33
Mn1 0.2 0.05 0.08 0.2 0.03 0.2

data_blockname2
_structure.id B   #Must be unique.
_structure.diffrn_id  DIFCON_1

_cell.length_a            3.0205
_cell.length_b            3.0205
_cell.length_c            3.0205
_cell.angle_alpha        90
_cell.angle_beta         90
_cell.angle_gamma        90
_cell.volume             27.558
_cell.formula_units_Z     2        

_space_group.id 2 #Must be unique. Can be the same if representing the same space group in the same setting
_space_group.crystal_system cubic
_space_group.name_H-M_alt Im-3m
loop_
  _space_group.symop_id
  _space_group.symop_operation_xyz
    1 'x, y, z '
    2 '-x, -y, z '
   #...
   95 'x+1/2, y+1/2, -z+1/2 '
   96 '-x+1/2, -y+1/2, -z+1/2 '

loop_
  _atom_site.label
  _atom_site.type_symbol
  _atom_site.fract_xyz
  _atom_site.B_iso_or_equiv
V1 V [0 0 0] 0.5

###################################
#
#  End of the CIF file
#
###################################

As per PR (AFAIK)

###################################
#
#  A representation of a merged datablock.
#  It shouldn't actually be used this way to construct
#  a CIF file, but maps out how the relational tables 
#  would be populated.
#
###################################

#this is the merged datablock assuming _structure.space_group_id exists
data_merged_hester
loop_
  _diffrn.id
  _diffrn.ambient_temperature
DIFCON_1  900  

loop_
  _structure.id
  _structure.space_group_id
  _structure.diffrn_id
A 1 DIFCON_1
B 2 DIFCON_1

loop_
  _cell.structure_id
  _cell.length_a
  _cell.length_b       
  _cell.length_c       
  _cell.angle_alpha    
  _cell.angle_beta     
  _cell.angle_gamma    
  _cell.volume         
  _cell.formula_units_Z  
A  5.4469 5.4469 5.4469 90 90 90 161.61  4        
B  3.0205 3.0205 3.0205 90 90 90  27.558 2        

# if both structures had the same SG, then you only need to include the one SG
loop_
  _space_group.id
  _space_group.crystal_system     
  _space_group.name_H-M_alt     
1 cubic Fm-3m 
2 cubic Im-3m

loop_
  _space_group_symop.space_group_id
  _space_group_symop.id
  _space_group_symop.operation_xyz
1   1 'x, y, z '
1   2 '-x, -y, z '
#...
1 191 'x, y+1/2, -z+1/2 '
1 192 '-x, -y+1/2, -z+1/2 '
2   1 'x, y, z '
2   2 '-x, -y, z '
#...
2  95 'x+1/2, y+1/2, -z+1/2 '
2  96 '-x+1/2, -y+1/2, -z+1/2 '

loop_
  _atom_site.structure_id
  _atom_site.label
  _atom_site.type_symbol
  _atom_site.fract_xyz
  _atom_site.B_iso_or_equiv
A Mn1 Mn+2 [0   0   0]   0.2 
A Se1 Se   [0.5 0.5 0.5] 0.4 
B V1  V    [0   0   0]   0.5 

loop_
  _atom_site_aniso.structure_id
  _atom_site_aniso.label
  _atom_site_aniso.b_11
  _atom_site_aniso.b_12
  _atom_site_aniso.b_13
  _atom_site_aniso.b_22
  _atom_site_aniso.b_23
  _atom_site_aniso.b_33
A Mn1 0.2 0.05 0.08 0.2 0.03 0.2
jamesrhester commented 8 months ago

That example does demonstrate exactly how I imagine this working.

vaitkus commented 8 months ago

I suggest that we include the example given in (https://github.com/COMCIFS/MultiBlock_Dictionary/pull/6#issuecomment-1764741476) as distinct CIF files in the PR. This will definitely be useful, since people are already asking for usage examples.

Furthermore, I have a comment on the following statement given in the example :

_space_group.id 1 #Must be unique. Can be the same if representing the same space group in the same setting

I think that having the same setting if not sufficient. For space groups to have the same identifier, their symmetry operations in the SPACE_GROUP loop must be listed with the same symop ids since these ids are later on used to specify symmetry operations in data items like _geom_bond.site_symmetry_1.

Consider the following example:

data_merged
# ...
loop_
_space_group.id
_space_group.name_H-M_alt
1 'P 1 21/m 1'
2 'P 1 21/m 1'
# ...
loop_
_space_group_symop.space_group_id
_space_group_symop.id
_space_group_symop.operation_xyz
1 1 x,y,z
1 2 -x,y+1/2,-z
1 3 -x,-y,-z
1 4 x,-y+1/2,z
2 1 x,y,z
2 2 -x,y+1/2,-z
2 3 x,-y+1/2,z
2 4 -x,-y,-z

loop_
_geom_bond.atom_site_label_1
_geom_bond.atom_site_label_2
_geom_bond.space_group_id # not currently defined
_geom_bond.site_symmetry_1
_geom_bond.site_symmetry_2
_geom_bond.distance
C2 C3  1 1_555 3_555 1.44

Semantically, the two space groups are identical (same name, same number, same setting, same symmetry operations), but due to the different ids assigned to the symops, they have to retain distinct ids.

Furthermore, _geom_bond.space_group_id in the GEOM_BOND loop should probably be replaced by _geom_bond.structure_id, but this assumes, that the proper structure-to-space-group relationship is defined in the STRUCTURE loop.

I do not think that we can achieve a more elegant solution in the constraints of the relational model since items like _geom_bond.structure_id prevent normalisation, but we need to be sure to properly communicate such gotchas to the users. Maybe it would make sense to describe the criteria required for two space groups to share the same space group id the definition of the _space_group.id data item?

rowlesmr commented 8 months ago

That sounds doable. A space group is the same iff it has the same name, number, setting, and symops in the same order. I can add this to the category description.

I am also put together some example structures and multi block cifs.

I agree that _geom_*.structure_id is the correct key to add, as the GEOM category is described as giving model information about the structure.

This also brings up the point as to the correct key for MODEL; should it be _model.structure_id? Should GEOM* have a .model_id instead as a key? At this point in time, I don't think so, as (iirc) MODEL is empty, but if we add not refinement things, it my become not empty.

rowlesmr commented 8 months ago

As mentioned in a comment to #3, _cell.diffrn_id and _cell_measurement.diffrn_id should stay but are no longer key data names.

They already exist in core. The multiblock just alters the key dataname. I take this to mean that they remain in the dictionary, so no need to redefine them?

jamesrhester commented 8 months ago

As mentioned in a comment to #3, _cell.diffrn_id and _cell_measurement.diffrn_id should stay but are no longer key data names.

They already exist in core. The multiblock just alters the key dataname. I take this to mean that they remain in the dictionary, so no need to redefine them?

I think that they should be moved to the multiblock dictionary, after which they will be removed from the core dictionary. This is because these data names have no use in the single-data-block paradigm (you can't refer to a _diffrn.id that is not the same as the current data block).

jamesrhester commented 8 months ago

That sounds doable. A space group is the same iff it has the same name, number, setting, and symops in the same order. I can add this to the category description.

Note that this is (mostly) just a particular case of the general rule that "if you repeat key data name values in different blocks, the rest of the values in the row must be identical". I don't think it warrants special mention in the category definition, but is worth pointing out to programmers as it is a useful consistency check. space_group_symop is a little special, because we have to autogenerate the symop numbers for legacy files. Anybody writing software now (and thus reading the dictionary) will provide symop ids, so I doubt mentioning this in the core dictionary will do anything except confuse readers.

And I'd say that a space group is the same if the items in the space_group category are the same for the same values of the key data name. That is, changing the order of the symops does not change the space group (just like in real life). What is does is change the identity of a symop that belongs to the space group.

rowlesmr commented 8 months ago

Note that this is (mostly) just a particular case of the general rule that "if you repeat key data name values in different blocks, the rest of the values in the row must be identical". I don't think it warrants special mention in the category definition, but is worth pointing out to programmers as it is a useful consistency check. space_group_symop is a little special, because we have to autogenerate the symop numbers for legacy files. Anybody writing software now (and thus reading the dictionary) will provide symop ids, so I doubt mentioning this in the core dictionary will do anything except confuse readers.

I haven't put it in the category description; it's in the _space_group.id description.

And I'd say that a space group is the same if the items in the space_group category are the same for the same values of the key data name. That is, changing the order of the symops does not change the space group (just like in real life). What is does is change the identity of a symop that belongs to the space group.

I agree that changing the order of the symops doesn't change the symmetry, but it does change how the symmetry is represented. As @vaitkus pointed out, having different _geom_*.symmetry_* values pointing to different symop id values requires that the symop indicated by that id be the same, and so it doesn't matter the order, as long as the rows have the correct id in the loop, and hence I think it is worth pointing out. It is a gotcha.

jamesrhester commented 7 months ago

An analysis of implications of the new STRUCTURE category with other dictionaries. I want to understand what we are creating for them by adding linked data names to categories that they also modify.

  1. Modulated structures. A modulated or composite structure is described using modulation waves and/or a series of "subsystems" that interpenetrate. a. Atom_Site: A subsystem id and flags for the type of modulation wave a particular site is additionally modified by are added, but the modulation information is in a different category b. cell: modulation information is added in this and other categories c. space group: superspace group information also added, same for space_group_symop etc.

A number of additional categories are defined that provide more modulation information. None of these have been made formal children of the above categories (yet) and so can be ignored - it is up to the ms_dic people to decide if they want to do more.

The cell_subsystem category is interesting, as it adopts the same approach as we have with multi blocks. However, only the atom_site category explicitly contains a pointer to _cell_subsystem.code.

Conclusion: 'Structure' as we have defined it (atom_site + cell + space_group) will partially describe a superspace structure. ms_dic can completely describe a structure (in the same sense as core CIF is complete) by adding the appropriate links to _structure.id in the appropriate categories. A single structure will include all subsystems.

  1. Magnetic structures. A magnetic structure is described by the positions and atomic moments of a subset of atoms combined with a magnetic space group. The modulations and superspace groups of the modulated structures dictionary are also used. The magnetic atom_site_moment category is a formal child of atom_site, meaning that atomic moments are currently incorporated into our concept of structure.

Here we come to an important point: we can imagine a magnetic_structure category that fulfills the same role as the structure category, but for magnetism. Such a category would have a pointer to _structure.id, and then include the magnetic space group and magnetism-specific modulation waves. What concerns me (slightly) is that the moments really belong only to the magnetic structure but are swept up into the structure overall. That is a consequence of wanting to list the moments in a single loop with the atomic positions, so in that sense the decision has already been made. An alternative to a separate magnetic structure category is simply for the magnetic structure dictionary to expand the concept of structure and add magnetic information to the structure category.

The magnetism dictionary also has the idea of a parent space group, which is a non-magnetic structure that the magnetic structure is related to. This is a good use for a pointer to _structure.id

Conclusion : the structure category is a slight misnomer in the case of magnetism, but does not create practical problems. If we want to be more universal in our naming, we could change the name to something like STRUCTURAL_MODEL

jamesrhester commented 5 months ago

I think this PR is ready for incorporation into the multi block dictionary following the above suggested changes and perhaps a change of the category from STRUCTURE to something like STRUCTURAL_MODEL to try to convey the more general meaning. Note that version 1.0.0 of the multi-block dictionary is now available from the IUCr website.

jamesrhester commented 3 months ago

Finally merged. The examples may need improvement.