Id-like dataname questions?

I've been looking at category keys for various reasons, and have happened upon some questions:

DIFFRN_REFLN: Keyed on _diffrn_refln.hkl, which is a Matrix of hkl values. Other categories (eg DIFFRN_ORIENT_REFLN) are keyed on the three indicies individually.
- should this be changed?
CHEMICAL_CONN_BOND: Keyed on .atom_1 and .atom_2, but also has id as a "Unique identifier for the bond.". The .id dataname isn't referred to anywhere else in core. The same with GEOM_ANGLE, GEOM_BOND , GEOM_CONTACT, GEOM_HBOND, GEOM_TORSION, and MODEL_SITE. Some of these are understandable, as there are many key datanames (looking at you GEOM_TORSION).
- Should having a single, unique, non-key identifier be a policy where there exists more than one dataname in a category key?

All of the REFLN-type categories should be keyed by a separate id type, as we cannot guarantee that hkl are unique. This is a real problem for modulated structures (hklmnop...) and raw data (same peak collected more than once). I've been putting this off, but needs to be discussed and done.

chemical_conn_bond et al: the references to id are leftovers from when there were such identifiers in an earlier draft. May be deleted.

As the the more general question of when to create such "synthetic" identifiers, there is no clear-cut answer. The original DDLm vision always had a single id for every Loop category, to make dREL of the form category[keyval] resolve. We've expanded the dREL rules so that multi-key-data-name categories will still resolve economically.

I think the practical answer is that if rows in a category will be linked to from other categories, then to avoid data name proliferation a synthetic identifier is worth creating. So, for example, the topology dictionary needs to identify nodes that are joined into a net, where a node might need an atomic label, symmetry operation id, and three lattice translations in order to identify it. The loop listing the nodes in a particular net could either refer to a synthetic node_id, or use five child data names of the above items to refer to a node - so, clearly creating a node_id is worthwhile.

The hkl problem is a little different - the issue here is not data name proliferation, but that items with a physical meaning are used as identifiers, opening us up to possible duplication (ie not a key any more) when the science develops. The three lattice translations used to identify a node in the previous paragraph are also bad in this sense, as modulated structures need to specify lattice translations in a different way. Hmm.

COMCIFS / cif_core

Id-like dataname questions? #462