COMCIFS / cif_core

The IUCr CIF core dictionary
15 stars 9 forks source link

units for dataitems with mixed units? #409

Closed rowlesmr closed 1 year ago

rowlesmr commented 1 year ago

Is there a unit code to be used for dataitems that have mixtures of units? or should it just not be included?

cf _atom_type_scat.cromer_mann_coeffs, where the units are a mixture of none and angstrom_squared

(also all the other scattering factor data items)

.

Could there be a units.code for mixed-type dataitems, where there are several different unit types? Not completely machine readable, but also shows that they aren't unitless, or arbitrary.

rowlesmr commented 1 year ago

Other dataitems with potentially problematic mixtures of units:

These ones might just be incorrect units:

There are probably others.

vaitkus commented 1 year ago

No, currently there is no attribute that would allow to specify multiple units of measurement.

I guess this could somewhat addressed by changing the _units.code to inherit its container from the defining item (_type.container = Implied). This would not change anything for single-valued data items and would allow to properly describe items with fixed-length data structures, e.g.:

save_atom_type_scat.cromer_mann_coeffs
...
    _description.text
;
    The set of Cromer-Mann coefficients for generating X-ray scattering
    factors. [ c, a1, b1, a2, b2, a3, b3, a4, b4 ]
    Ref: International Tables for Crystallography, Vol. C
            (1991) Table 6.1.1.4
;
...
_units.code [ none,
              none, angstrom_squared,
              none, angstrom_squared,
              none, angstrom_squared,
              none, angstrom_squared ]

This still leaves the problem open for tables and lists of indeterminate lengths. One option could be to introduce a new special _units.code value calculated which would clearly indicate that the value is indented to be calculated using a dREL method (I guess it should almost always be possible to programmatically construct a data structure of measurement units that mirrors the data structure of the main values). This value could also be used in cases where the units for a singular value is already calculated (e.g. _exptl_crystal.f_000). For software implementations that properly handle dREL the calculated values would always be replaced with other values during the processing while software that lacks this functionality would at least be able to identify such situations.

The _atom_type_scat.exponential_polynomial_coefs is a beast of its own. Even if we use the previously suggested dREL based-approach of programmatically assigning units, we would quickly run out of enumerated units (angstrom_quadrupled is already not there). The proper approach would be to split the unit and the exponent into separate items (e.g. "angstrom_squared" would be split into "angstrom" and "2"), but this seems like a very big change.

These ones might just be incorrect units:

This seems like a separate topic which should be split up into a different issue if more in-depth discussions are required:

  • _chemical_conn_atom.charge: maybe electrons?
  • _valence_param.atom_1|2_valence: maybe electrons?
  • _atom_type.oxidation_number: maybe electrons?

The assignment of electrons as units to some of these items was previously discussed and then refused for now (see https://github.com/COMCIFS/cif_core/issues/62). I think it is safe to leave them without units until requested otherwise.

  • _atom_type.mass_number: maybe 'daltons'? I'm not too sure on my chemistry units.

Mass number is the count of neutrons and protons in the nucleus so units are probably not needed.

rowlesmr commented 1 year ago

So it looks like the "incorrect" ones are correct. .

It looks like there isn't any way currently* to arbitrarily enumerate units for some data items. Could a _units.code value of mixed (or some other word) be used to indicate that there are more than one unit type in the associated value? (It's probably better than none).

* and any changes to make it so would be difficult, and break many other things.

jamesrhester commented 1 year ago

Here's another suggestion: we could define . for _units.code to mean "not supplied" (ie figure it out yourself), rather than trying to capture units for complex lists. If a dREL method is provided to construct a value for the defined item, then that could instead be analysed by an application in order to understand the units. In the case of _atom_type_scat.exponential_polynomial_coefs, where the entries in the list have no single-value data name that they are composed from, a tedious Definition method for _units.code could be devised if anyone cared to. Or dREL explaining how the data name is used could be analysed to derive the units.

What I'm getting at is that I don't think we should spend our scarce labour on providing the units for compound data values unless there is some obvious benefit.

I guess this could somewhat addressed by changing the _units.code to inherit its container from the defining item (_type.container = Implied). This would not change anything for single-valued data items

I think this is a fine idea, because it is indeed true that this is the case.

rowlesmr commented 1 year ago

. for figure it out yourself works for me. Especially as these containers are (normally) filled with values that have defined units.

vaitkus commented 1 year ago

@jamesrhester I am all for not spending more time on this than we have too, but suggested used of . does not seem right to me, since . ends up having two different interpretations -- "not applicable" (in case when used with non-numeric values) and "not supplied" (in case of the given scenarios).

I agree with the general idea, but maybe we could use a different enumeration value for this purpose? Something like unspecified with a huge warning that the use of this enumeration value should be avoided if possible.

jamesrhester commented 1 year ago

If we consider that it makes no sense for compound data types containing values with differing units to have a single unit applied to them, then "not applicable" is reasonable. unspecified is also acceptable to me. Possible text: Units not specified, but may be derived. The body of the _units.code definition could outline when use is possible, e.g. "unspecified is used only for compound data types where values have differing units"

rowlesmr commented 1 year ago

The body of the _units.code definition could outline when use is possible, e.g. "unspecified is used only for compound data types where values have differing units"

If that is the type of definition you're after, then I would vote for mixed as the name of the value to be used. It says "there are units in there, but we can't (easily) tell you what they are. Go figure it out.".

vaitkus commented 1 year ago

@rowlesmr By mixed do you refer to the fact that the same structured value may have different units (e.g. as in case of _atom_type_scat.cromer_mann_coeffs) or to the fact that different values of the same data item may have different units (e.g. depending on the value of some other data item)?

I am not strongly against mixed, but to me it comes with certain additional semantics (e.g. mixed in what way?) whereas unspecified is more of a catch-all term for all situations we are not capable of properly addressing under the current approach. However, I am fine either way.

Between this and data typing limitations mentioned in this discussion (https://github.com/COMCIFS/cif_core/pull/399#issuecomment-1577484771), maybe we should start a wishlist of all the features we would like to eventually fix/address in DDLm 5/6/7?

rowlesmr commented 1 year ago

I meant in the first sense.

Looks like unspecified is coming out on top.

jamesrhester commented 1 year ago

Let's go with unspecified then.

vaitkus commented 1 year ago

I will reopen this issue until we change the units of some items to unspecified. The preliminary list of such items can be found in this comment (https://github.com/COMCIFS/cif_core/issues/409#issuecomment-1574719776) by @rowlesmr.