COMCIFS / cif_core

The IUCr CIF core dictionary
14 stars 9 forks source link

Id-like dataname questions? #462

Open rowlesmr opened 8 months ago

rowlesmr commented 8 months ago

I've been looking at category keys for various reasons, and have happened upon some questions:

jamesrhester commented 8 months ago

All of the REFLN-type categories should be keyed by a separate id type, as we cannot guarantee that hkl are unique. This is a real problem for modulated structures (hklmnop...) and raw data (same peak collected more than once). I've been putting this off, but needs to be discussed and done.

As the the more general question of when to create such "synthetic" identifiers, there is no clear-cut answer. The original DDLm vision always had a single id for every Loop category, to make dREL of the form category[keyval] resolve. We've expanded the dREL rules so that multi-key-data-name categories will still resolve economically.

I think the practical answer is that if rows in a category will be linked to from other categories, then to avoid data name proliferation a synthetic identifier is worth creating. So, for example, the topology dictionary needs to identify nodes that are joined into a net, where a node might need an atomic label, symmetry operation id, and three lattice translations in order to identify it. The loop listing the nodes in a particular net could either refer to a synthetic node_id, or use five child data names of the above items to refer to a node - so, clearly creating a node_id is worthwhile.

The hkl problem is a little different - the issue here is not data name proliferation, but that items with a physical meaning are used as identifiers, opening us up to possible duplication (ie not a key any more) when the science develops. The three lattice translations used to identify a node in the previous paragraph are also bad in this sense, as modulated structures need to specify lattice translations in a different way. Hmm.