COMCIFS / cif_core

The IUCr CIF core dictionary
14 stars 9 forks source link

Add deprecation dates to DDL1 aliases #479

Closed vaitkus closed 4 months ago

vaitkus commented 4 months ago

This PR closes issue #477.

The deprecation dates were assigned to match the release dates of specific dictionary versions. For example, data name _atom_site_thermal_displace_type was marked as deprecated in 1999-03-24 since it was deprecated with the release of CIF_CORE dictionary version 2.1 released on 1999-03-24. In the case of this specific data item the internal human-readable changelog of the DDL1 dictionary does provide a more specific deprecation date (1997-10-30) which falls between the release of version 2.1 and the previous version 2.0.1 (released on 1997-01-20), however, such dates are not available to all data items so it seems reasonable to go with the dictionary release dates.

There are several open question that should be answered before merging this PR:

  1. Some of the deprecated aliases have both undotted and dotted variants (e.g. _refine_ls_R_factor_obs and _refine.ls_R_factor_obs). It was agreed that the undotted ones should be marked as deprecated, but what about the dotted ones? For example, the _refine.ls_R_factor_obs data name was never assigned in the DDL1 CIF_CORE dictionary and thus was never deprecated. Furthermore, this data name was assigned in the mmCIF dictionary [1] where it is seemingly still not deprecated. However, since the _refine_ls_R_factor_obs was deprecated mainly to replace the term obs ("observed") with gt ("greater than a specific threshold"), logically, the use of _refine.ls_R_factor_obs should also be discouraged. For now, I have marked the deprecation date for these items with an unquoted question mark.
  2. Some aliases with an unclear origin contain misleading terms. For example, data names _reflns_number_obs and _diffrn_reflns_av_sigmaI_over_netI contain terms obs and sigma, but they do not seem to have ever been defined neither in the DDL1 CIF_CORE dictionary nor in the mmCIF dictionary. I assume that they were added in DDLm. Should these names be deprecated with the current date or not deprecated at all?
  3. The DDLm reference dictionary also allows to specify the URI of the dictionary which originally defined the deprecated data name (see _alias.dictionary_uri). Should we use it here? If yes, should the URIs refer to specific dictionary versions (e.g. https://www.iucr.org/__data/iucr/cif/dictionaries/cif_core_2.4.4.dic, https://www.iucr.org/__data/iucr/cif/dictionaries/cif_core_2.2.dic ) or should they all link to the latest version of the dictionary (e.g. https://www.iucr.org/__data/iucr/cif/dictionaries/cif_core_2.4.5.dic)?

[1] https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_refine.ls_R_I_factor_obs.html

vaitkus commented 4 months ago

The detected layout issues seem to be false-positives (see https://github.com/jamesrhester/julia_cif_tools/issues/9).

jamesrhester commented 4 months ago

Thoughts on the issues raised:

  1. If a data name is not deprecated in mmCIF, we should not deprecate it in core CIF, in order to maintain compatibility
  2. Any aliased names that were never defined can be simply removed from the alias list. We could double-check COD to make sure they've never been used.
  3. I think the URIs should refer to a specific version, but I don't think this information is essential. If we are doing our job correctly, the current definition should match the deprecated definition and therefore the deprecated definition doesn't contain any new information.
jamesrhester commented 4 months ago

Layout checker fixed.

vaitkus commented 4 months ago

@jamesrhester, thank you for fixing the linter.

  • If a data name is not deprecated in mmCIF, we should not deprecate it in core CIF, in order to maintain compatibility

Ok, these are now marked with the '.' to indicate they are not deprecated.

  • Any aliased names that were never defined can be simply removed from the alias list. We could double-check COD to make sure they've never been used.

Ok, I removed the _diffrn_reflns_av_sigmaI_over_netI and _reflns_number_obs aliases since they contained deprecated terminology. The _reflns_number_obs was used in the COD, but only in 2 or 3 entries, so this can be treated as a typo.

  • I think the URIs should refer to a specific version, but I don't think this information is essential. If we are doing our job correctly, the current definition should match the deprecated definition and therefore the deprecated definition doesn't contain any new information.

I suggested adding URIs mainly for data provenance purposes as it seems quite useful to known the origin of a specific alias (e.g. one may automatically cross-check if the definitions are still the same, etc.). Note, that the mmCIF dictionary already records data name alias provenance in a similar way:

   _item_aliases.alias_name  "_reflns_number_observed"
   _item_aliases.dictionary  cif_core.dic
   _item_aliases.version     2.0.1

I am happy to merge this PR without the URIs, but please let me know if you would like me to add them for the deprecated aliases since I currently have this information at hand. I would mark the remaining URIs with the '?' special value since tracking down the origin of all names seems like a separate (big) task.

vaitkus commented 4 months ago

Thank you for the review. I think I will add the URIs in a separate issue/PR.