COMCIFS / cif_core

The IUCr CIF core dictionary
14 stars 9 forks source link

Clarify the semantics of the `_alias.dictionary_uri` attribute #481

Closed vaitkus closed 2 months ago

vaitkus commented 4 months ago

The current version of the DDLm reference dictionary defines the _alias.dictionary_uri attribute as:

Absolute URI of dictionary to which the aliased definition belongs.

However, the definition is a bit imprecise which prevents this attribute from being effectively used in an automated fashion. For example, the same data name may belong to several versions of the the same dictionary and sometimes with slightly different semantics (or at least different human-readable definitions). I suggest we clarify the definition along the lines of:

Absolute URI of the dictionary to which the aliased definition belongs.
The URI should preferably point to the latest version of the dictionary
that is known to contain the most semantically similar definition.

The main benefit of this would be that we could run automated checks from time to time to see if the definitions from different dictionaries (e.g. mmCIF and CIF_CORE) still match up to the specified level (e.g. we might not require the human-readable definitions to match verbatim, but having the same enumeration ranges would be great). I specifically used "should" in the reformulated definition to indicate that the definitions may become out of sync from time to time, but that one should strive to get them in sync when possible.

Alternatively, we may anchor the URI to a different dictionary version, e.g. the version that originally defined the data item.

jamesrhester commented 4 months ago

I think our modest ambition for the 'alias' attributes is to automatically recognise different variants of a data name, such that software can successfully use aliases interchangeably. Any software-relevant semantic differences would require a new data name to be defined, where we might be tolerant in practice to changes that have no practical effect (such as changing the minimum value from 0 to 1, if we know that nobody has used 0). So, from the point of view of software, the particular version pointed to by the URI shouldn't matter.

Also, it becomes difficult for dictionary writers to make the 'semantically similar' judgement.

I would be in favour therefore of simply pointing to the latest known version that defined the alias, so the wording would be:

Absolute URI of the latest version of the dictionary containing the aliased definition.

vaitkus commented 4 months ago

Absolute URI of the latest version of the dictionary containing the aliased definition.

Seems ok to me, I created PR #485 to address this. I will update the draft PR #483 accordingly.

But this also got me thinking, that including the full URI for each alias seems like a significant duplication of data. The same two or three URIs will be repeated in almost all definitions (imagine one of the dictionaries moving to a different location). Would it make sense to (eventually) introduce something like _alias.dictionary_id <-> ALIAS_DICTIONARY (id, uri, version, etc.)?

jamesrhester commented 3 months ago

Would it make sense to (eventually) introduce something like _alias.dictionary_id <-> ALIAS_DICTIONARY (id, uri, version, etc.)?

In theory this would be nice (more normalised as the DB people like to say) but in practice it just means programmers have to write in an extra step of indirection and readers have to scroll around. Global search and replace makes editing multiple entries trivial.

However, even doing it that way, we have created a real workload for ourselves if we are trying to keep up with wwPDB latest version, so we might want to finesse our definition to state that it is "Absolute URI of a version of the dictionary containing the latest version of the aliased definition." so that if the text of the definition doesn't change (which is true for 99% of the PDB definitions that we alias) then we don't have to update our definition either.

vaitkus commented 3 months ago

However, even doing it that way, we have created a real workload for ourselves if we are trying to keep up with wwPDB latest version, so we might want to finesse our definition to state that it is "Absolute URI of a version of the dictionary containing the latest version of the aliased definition." so that if the text of the definition doesn't change (which is true for 99% of the PDB definitions that we alias) then we don't have to update our definition either.

I see your point. I would further update the proposed phrasing to: "Absolute URI of a version of the dictionary containing the latest compatible version of the aliased definition.".

But this latest round of discussions also made me realise, that we might not always have URIs for specific dictionary versions (e.g. PDB does does not seem to do that for mmCIF/PDBx dictionaries). I therefore propose the following approach:

  1. Add a new _alias.dictionary_version attribute with the following definition:
    A version identifier of the dictionary that includes the latest compatible version of the aliased definition.
  2. Change the _alias.dictionary_URI attribute definition to:
    An absolute URI of the dictionary that includes the aliased definition. The URI should preferably point
    to a specific version of the dictionary identified by the _alias.dictionary_version attribute, however, a
    more general URI, e.g. one that always points to the latest version of the dictionary, can also be provided
    when this is not possible.

If you are OK with this approach, I can update PR #482 to reflect these changes. What do you think?

jamesrhester commented 3 months ago

Yes, I agree with these suggestions.