Open rowlesmr opened 8 months ago
Yes, I agree that this is safe to do for single-datablock datasets. Indeed, we have to explicitly assume this as otherwise we're making all existing data blocks invalid due to lack of a key data name for the Set
categories.
What about for a cif file with multiple blocks? I argue that it is still safe, as it emulates current cif behaviour, where each block is a distinct structure.
Yes. If it is not asserted that these blocks belong together, then no problem, they are single-block datasets collected into one file.
If it is asserted that they belong together, then using UUIDs for the missing id values before merging them would reflect the actual semantic disconnectedness of the blocks.
Yes. If it is not asserted that these blocks belong together, then no problem, they are single-block datasets collected into one file.
Random collection of blocks, autogenerate to your heart's content. ✓
If it is asserted that they belong together, then using UUIDs for the missing id values before merging them would reflect the actual semantic disconnectedness of the blocks.
Non-random collection of blocks, autogenerate is good, as the CIF creator would have included keys if they meant to link things?
A pathological question:
# if
data_block1
_structure.id A
_cell.length_a 12.3456
data_block2
_structure.id A
_cell.length_b 12.3456
data_block3
_structure.id A
_cell.length_c 12.3456
data_block4
_structure.id A
_cell.angle_alpha 90
data_block5
_structure.id A
_cell.angle_beta 90
data_block6
_structure.id A
_cell.angle_gamma 90
# becomes
data_merged
loop_
_cell.structure_id
_cell.length_a
_cell.length_b
_cell.length_c
_cell.angle_alpha
_cell.angle_beta
_cell.angle_gamma
A 12.3456 12.3456 12.3456 90 90 90
#what should
data_block11
_audit.block_code GUID_11
loop_
_audit_link.block_code
GUID_22 GUID_33 GUID_44 GUID_55 GUID_66
_cell.length_a 12.3456
data_block22
_audit.block_code GUID_22
loop_
_audit_link.block_code
GUID_11 GUID_33 GUID_44 GUID_55 GUID_66
_cell.length_b 12.3456
data_block33
_audit.block_code GUID_33
loop_
_audit_link.block_code
GUID_11 GUID_22 GUID_44 GUID_55 GUID_66
_cell.length_c 12.3456
data_block44
_audit.block_code GUID_44
loop_
_audit_link.block_code
GUID_11 GUID_22 GUID_33 GUID_55 GUID_66
_cell.angle_alpha 90
data_block55
_audit.block_code GUID_55
loop_
_audit_link.block_code
GUID_11 GUID_22 GUID_33 GUID_44 GUID_66
_cell.angle_beta 90
data_block66
_audit.block_code GUID_66
loop_
_audit_link.block_code
GUID_11 GUID_22 GUID_33 GUID_44 GUID_55
_cell.angle_gamma 90
# become ?
Lacking any structure_id
data names, the merge that you suggest is reasonable, as would be a merge like your first example. This is why the audit_link
scheme (or another block pointer scheme) is deficient: it says that blocks belong together, but it doesn't provide a machine-actionable general specification for how they should be combined.
From discussion within https://github.com/COMCIFS/cif_core/pull/445 and mentioned in https://github.com/COMCIFS/MultiBlock_Dictionary/pull/6#issuecomment-1764741476
I posit that if a
Set
category (or single packet Loop) key datavalue doesn't exist in a given datablock, then it is safe to autogenerate one.I believe that this would make most of the multiblock dictionary transparent to the end-user, as there is no requirement to give (for example)
_structure.id
or_space_group.id
values if you're not availing yourself of their features.James does say
but in this case, you're actually using multiblock features, and so you need to define them.
.
An example of my thinking is:
_structure.id
has been explicitly defined, so I must assume it could be used somewhere else._space_group.id
has not been defined, so I can give it some arbitrary value, as there is no other way to link space group information between datablocks*. This same arbitrary value is then also given to_structure.space_group_id
, linking** together structureA
with space groupSOME_GUID_VALUE
. The key ofCELL
is_cell.structure_id
, so this takes the valueA
, as it is in the same block as an explicitly defined_structure.id
.* Unless there are some
AUDIT_BLOCK
shenanigans you could do...** as
_structure.space_group_id
is literally aLink
to_space_group.id
.