COMCIFS / Powder_Dictionary

CIF definitions for powder diffraction
4 stars 4 forks source link

dREL name `refln.wavelength_id` is not unique #4

Closed vaitkus closed 2 years ago

vaitkus commented 2 years ago

The combination of _name.category_id and _name.object_id attributes is used to uniquely identify data items in dREL code snippets. However, the combination of refln and wavelength_id is not unique as it appears both in the save__pd_refln.wavelength_id save frame from the CIF_POW dictionary and in the save_refln.wavelength_id save frame from the CIF_CORE dictionary (which is imported by CIF_POW).

The dREL name does not seem to be used in any of the dREL codes snippets that are present in CIF_CORE and CIF_POW dictionaries so it should be safe to rename any of the previously mentioned items.

Also, note that _pd_refln.wavelength_id item belongs to the REFLN category which appears both in CIF_CORE and in CIF_POW and is the only category in CIF_POW without the PD prefix. The CIF_CORE dictionary is imported with the 'Ignore' mode which forces all duplicate data items to be silently ignored, however, in this specific case save__pd_refln.wavelength_id and save_refln.wavelength_id are not considered duplicates due to different _definition.id attribute values (_pd_refln.wavelength_id vs. _refln.wavelength_id).

jamesrhester commented 2 years ago

I think this can be fixed by simply calling _pd_refln.wavelength_id an alias of _refln.wavelength_id in cif_core.

vaitkus commented 2 years ago

Would the proposed solution: a) add the alias to CIF_CORE and remove the save__pd_refln.wavelength_id save frame from the CIF_POW altogether. b) add the alias to CIF_CORE, use it to detect duplicate definitions and silently ignore the import?

Approach a) is simpler, but it also results in data name from the CIF_POW dictionary being fragmented over several dictionaries. Approach b) is more flexible, however, I would probably inverse the relationship by adding _refln.wavelength_id as an alias of _pd_refln.wavelength_id in CIF_POW and this way retain all of the _pd items strictly in CIF_POW.

However, I think I made a slight mistake in my last message by claiming that the duplicates are not detected due to the differing _definition.id attribute values since duplicate save frames detection is based on their save frame codes and not on the _definition.id attribute. So the simplest solution might be to simply rename save__pd_refln.wavelength_id to save_refln.wavelength_id in the CIF_POW dictionary and potentially add _refln.wavelength_id data name as an alias. What do you think?

jamesrhester commented 2 years ago

I see the issue with my suggestion - a user will naturally check the powder dictionary for this data name, even if automatic tools would see it in CIF core and be happy.

I'm not comfortable with using identical save frame names as it feels like a technical 'trick' and may be fragile as it would be a single exception that future tools may inadvertently overwrite. So I think it is best to simply add _pd_refln.wavelength_id as an alias in CIF_CORE, and _refln.wavelength_id as an alias in cif_pow, leaving the current save frame names. Essentially we are duplicating the definition in the same way as creating an alias 'duplicates' the definition, and in this case _name.category_id and _name.object_id should be the same. We can note in the cif_pow definition that this is purely a duplicate of the core definition and recommend using the core equivalent.

This of course creates the possibility of a data block containing both data names, but that is no different to data blocks containing aliases of data names that are already present.

vaitkus commented 2 years ago

I'm not comfortable with using identical save frame names as it feels like a technical 'trick' and may be fragile as it would be a single exception that future tools may inadvertently overwrite.

Is the previously proposed approach really that out of the ordinary? My proposal was based on the following observations:

  1. Several save frames in CIF_POW already have frame names that are also used in CIF_CORE (_refln.F_squared_meas, _refln.F_complex) and are thus redefined due to the 'Ignore' import mode. However, it is important to note that in those cases the data names did not change nor get new aliases.
  2. Save frame names play an important role in the dictionary import mechanism and thus should not be changed without a good reason. Save frame names may be referenced in import statements of other dictionaries, including those not maintained by the IUCr, therefore any risk of accidentally overwriting save frame names must be minimised anyway.

So I think it is best to simply add _pd_refln.wavelength_id as an alias in CIF_CORE, and _refln.wavelength_id as an alias in cif_pow, leaving the current save frame names. Essentially we are duplicating the definition in the same way as creating an alias 'duplicates' the definition, and in this case _name.category_id and _name.object_id should be the same. We can note in the cif_pow definition that this is purely a duplicate of the core definition and recommend using the core equivalent.

I see two drawbacks to this approach:

  1. Adding _pd_refln.wavelength_id as an alias in CIF_CORE would imply that it is OK to use this name instead of _refln.wavelength_id even if the rest of the file does not describe a powder experiment. This is not a big problem, but it does look quite strange and slightly deviates from the usage in DDL1.
  2. After the import, the CIF_POW dictionary would ends up with two save frames that define items with identical data names. Since, to my understanding, duplicate imports are detected based strictly on save frame names, both save__pd_refln.wavelength_id and save_refln.wavelength_id save frames would simultaneously define _pd_refln.wavelength_id and _refln.wavelength_id data names. The proposed human-readable note about one of the save frames being an intentional duplicate would be useful to the users, but would be of little help in automatically resolving this in dictionary handling software. It would be much more robust to handle this without having to rely on any ad-hoc built in rules.

This of course creates the possibility of a data block containing both data names, but that is no different to data blocks containing aliases of data names that are already present.

I would view the situation where an alias in one save frame matches a data name (or an alias) in another save frame as ambiguous and thus a mistake. Also, depending on the software implementation this may lead to both data name definitions being simultaneously defined or one of the definitions being randomly picked up over the other which may even differ between software runs (e.g. implementation relying on hash tables).

jamesrhester commented 2 years ago

Adding _pd_refln.wavelength_id as an alias in CIF_CORE would imply that it is OK to use this name instead of _refln.wavelength_id even if the rest of the file does not describe a powder experiment. This is not a big problem, but it does look quite strange and slightly deviates from the usage in DDL1.

This is true. Non-powder software is unlikely to recognise this alias (unless it reads in the dictionary).

So the simplest solution might be to simply rename save__pd_refln.wavelength_id to save_refln.wavelength_id in the CIF_POW dictionary and potentially add _refln.wavelength_id data name as an alias. What do you think?

I think this is the best solution, due to the points made above. I will ask the pd list about deprecating the _pd_refln.wavelength_id data name, which may bring up the reasons for its existence. Or not.

jamesrhester commented 2 years ago

PR #6 fixes this.

vaitkus commented 2 years ago

The issue was resolved by merging PR #6.