Auto-generate `Set` key values?

rowlesmr commented 8 months ago

From discussion within https://github.com/COMCIFS/cif_core/pull/445 and mentioned in https://github.com/COMCIFS/MultiBlock_Dictionary/pull/6#issuecomment-1764741476

I posit that if a Set category (or single packet Loop) key datavalue doesn't exist in a given datablock, then it is safe to autogenerate one.

I believe that this would make most of the multiblock dictionary transparent to the end-user, as there is no requirement to give (for example) _structure.id or _space_group.id values if you're not availing yourself of their features.

James does say

I think that would be a harmless rule to have. Maybe not as useful as it looks, because somewhere else that item will be referenced, otherwise there's no point to having it identified in the first place. So then that "other place" will need to know what the ID is. For example, if a parent structure wants to reference a child structure, or a phase wants to identify what structure it has.

but in this case, you're actually using multiblock features, and so you need to define them.

.

An example of my thinking is:

data_block
  _structure.id A
  _space_group.IT_number 62
  _cell.length_a 12.3456

_structure.id has been explicitly defined, so I must assume it could be used somewhere else. _space_group.id has not been defined, so I can give it some arbitrary value, as there is no other way to link space group information between datablocks*. This same arbitrary value is then also given to _structure.space_group_id, linking** together structure A with space group SOME_GUID_VALUE. The key of CELL is _cell.structure_id, so this takes the value A, as it is in the same block as an explicitly defined _structure.id.

* Unless there are some AUDIT_BLOCK shenanigans you could do...

** as _structure.space_group_id is literally a Link to _space_group.id.

jamesrhester commented 8 months ago

Yes, I agree that this is safe to do for single-datablock datasets. Indeed, we have to explicitly assume this as otherwise we're making all existing data blocks invalid due to lack of a key data name for the Set categories.

rowlesmr commented 8 months ago

What about for a cif file with multiple blocks? I argue that it is still safe, as it emulates current cif behaviour, where each block is a distinct structure.

jamesrhester commented 8 months ago

Yes. If it is not asserted that these blocks belong together, then no problem, they are single-block datasets collected into one file.

If it is asserted that they belong together, then using UUIDs for the missing id values before merging them would reflect the actual semantic disconnectedness of the blocks.

rowlesmr commented 8 months ago

Yes. If it is not asserted that these blocks belong together, then no problem, they are single-block datasets collected into one file.

Random collection of blocks, autogenerate to your heart's content. ✓

If it is asserted that they belong together, then using UUIDs for the missing id values before merging them would reflect the actual semantic disconnectedness of the blocks.

Non-random collection of blocks, autogenerate is good, as the CIF creator would have included keys if they meant to link things?

rowlesmr commented 8 months ago

A pathological question:

# if

data_block1
  _structure.id A
  _cell.length_a 12.3456
data_block2
  _structure.id A
  _cell.length_b 12.3456
data_block3
  _structure.id A
  _cell.length_c 12.3456
data_block4
  _structure.id A
  _cell.angle_alpha 90
data_block5
  _structure.id A
  _cell.angle_beta 90
data_block6
  _structure.id A
  _cell.angle_gamma 90

# becomes

data_merged
loop_
  _cell.structure_id
  _cell.length_a
  _cell.length_b       
  _cell.length_c       
  _cell.angle_alpha    
  _cell.angle_beta     
  _cell.angle_gamma  
A 12.3456 12.3456 12.3456 90 90 90

#what should

data_block11
  _audit.block_code GUID_11
  loop_
  _audit_link.block_code
  GUID_22 GUID_33 GUID_44 GUID_55 GUID_66
  _cell.length_a 12.3456
data_block22
  _audit.block_code GUID_22
  loop_
  _audit_link.block_code
  GUID_11 GUID_33 GUID_44 GUID_55 GUID_66
  _cell.length_b 12.3456
data_block33
  _audit.block_code GUID_33
  loop_
  _audit_link.block_code
  GUID_11 GUID_22 GUID_44 GUID_55 GUID_66
  _cell.length_c 12.3456
data_block44
  _audit.block_code GUID_44
  loop_
  _audit_link.block_code
  GUID_11 GUID_22 GUID_33 GUID_55 GUID_66
  _cell.angle_alpha 90
data_block55
  _audit.block_code GUID_55
  loop_
  _audit_link.block_code
  GUID_11 GUID_22 GUID_33 GUID_44 GUID_66
  _cell.angle_beta 90
data_block66
  _audit.block_code GUID_66
  loop_
  _audit_link.block_code
  GUID_11 GUID_22 GUID_33 GUID_44 GUID_55
  _cell.angle_gamma 90

# become ?

Maybe this?

data_merged1 loop_ _cell.structure_id _cell.length_a _cell.length_b _cell.length_c _cell.angle_alpha _cell.angle_beta _cell.angle_gamma STRUCTURE_GUID_A 12.3456 . . . . . STRUCTURE_GUID_B . 12.3456 . . . . STRUCTURE_GUID_C . . 12.3456 . . . STRUCTURE_GUID_D . . . 90 . . STRUCTURE_GUID_E . . . . 90 . STRUCTURE_GUID_F . . . . . 90

jamesrhester commented 8 months ago

Lacking any structure_id data names, the merge that you suggest is reasonable, as would be a merge like your first example. This is why the audit_link scheme (or another block pointer scheme) is deficient: it says that blocks belong together, but it doesn't provide a machine-actionable general specification for how they should be combined.

COMCIFS / MultiBlock_Dictionary

Auto-generate `Set` key values? #7