Closed rowlesmr closed 1 year ago
Other dataitems with potentially problematic mixtures of units:
_refln.form_factor_table
: It's a table of s vs scattering factor values_atom_type_scat.cromer_mann_coeffs
: it's a List
of none
and angstrom_squared
_atom_type_scat.exponential_polynomial_coefs
: arbitrary non-negative integer powers of angstrom_atom_type_scat.gaussian_coefs
: it's a List
of none
and angstrom_squared
_atom_type_scat.hi_ang_fox_coeffs
: it's a List
of none
, angstrom
, angstrom_squared
, and angstom_cubed
_atom_type_scat.inv_mott_bethe_coefs
: it's a List
of none
and angstrom_squared
_atom_type_scat.versus_stol_list
: It's an array of s vs scattering factor valuesThese ones might just be incorrect units:
_chemical_conn_atom.charge
: maybe electrons
?_valence_param.atom_1|2_valence
: maybe electrons
?_atom_type.mass_number
: maybe 'daltons'? I'm not too sure on my chemistry units._atom_type.oxidation_number
: maybe electrons
?There are probably others.
No, currently there is no attribute that would allow to specify multiple units of measurement.
I guess this could somewhat addressed by changing the _units.code
to inherit its container from the defining item (_type.container
= Implied
). This would not change anything for single-valued data items and would allow to properly describe items with fixed-length data structures, e.g.:
save_atom_type_scat.cromer_mann_coeffs
...
_description.text
;
The set of Cromer-Mann coefficients for generating X-ray scattering
factors. [ c, a1, b1, a2, b2, a3, b3, a4, b4 ]
Ref: International Tables for Crystallography, Vol. C
(1991) Table 6.1.1.4
;
...
_units.code [ none,
none, angstrom_squared,
none, angstrom_squared,
none, angstrom_squared,
none, angstrom_squared ]
This still leaves the problem open for tables and lists of indeterminate lengths. One option could be to introduce a new special _units.code
value calculated
which would clearly indicate that the value is indented to be calculated using a dREL method (I guess it should almost always be possible to programmatically construct a data structure of measurement units that mirrors the data structure of the main values). This value could also be used in cases where the units for a singular value is already calculated (e.g. _exptl_crystal.f_000
). For software implementations that properly handle dREL the calculated
values would always be replaced with other values during the processing while software that lacks this functionality would at least be able to identify such situations.
The _atom_type_scat.exponential_polynomial_coefs
is a beast of its own. Even if we use the previously suggested dREL based-approach of programmatically assigning units, we would quickly run out of enumerated units (angstrom_quadrupled
is already not there). The proper approach would be to split the unit and the exponent into separate items (e.g. "angstrom_squared" would be split into "angstrom" and "2"), but this seems like a very big change.
These ones might just be incorrect units:
This seems like a separate topic which should be split up into a different issue if more in-depth discussions are required:
_chemical_conn_atom.charge
: maybeelectrons
?_valence_param.atom_1|2_valence
: maybeelectrons
?_atom_type.oxidation_number
: maybeelectrons
?
The assignment of electrons
as units to some of these items was previously discussed and then refused for now (see https://github.com/COMCIFS/cif_core/issues/62). I think it is safe to leave them without units until requested otherwise.
_atom_type.mass_number
: maybe 'daltons'? I'm not too sure on my chemistry units.
Mass number is the count of neutrons and protons in the nucleus so units are probably not needed.
So it looks like the "incorrect" ones are correct. .
It looks like there isn't any way currently* to arbitrarily enumerate units for some data items. Could a _units.code
value of mixed
(or some other word) be used to indicate that there are more than one unit type in the associated value? (It's probably better than none
).
* and any changes to make it so would be difficult, and break many other things.
Here's another suggestion: we could define .
for _units.code
to mean "not supplied" (ie figure it out yourself), rather than trying to capture units for complex lists. If a dREL method is provided to construct a value for the defined item, then that could instead be analysed by an application in order to understand the units. In the case of _atom_type_scat.exponential_polynomial_coefs
, where the entries in the list have no single-value data name that they are composed from, a tedious Definition
method for _units.code
could be devised if anyone cared to. Or dREL explaining how the data name is used could be analysed to derive the units.
What I'm getting at is that I don't think we should spend our scarce labour on providing the units for compound data values unless there is some obvious benefit.
I guess this could somewhat addressed by changing the _units.code to inherit its container from the defining item (_type.container = Implied). This would not change anything for single-valued data items
I think this is a fine idea, because it is indeed true that this is the case.
.
for figure it out yourself works for me. Especially as these containers are (normally) filled with values that have defined units.
@jamesrhester I am all for not spending more time on this than we have too, but suggested used of .
does not seem right to me, since .
ends up having two different interpretations -- "not applicable" (in case when used with non-numeric values) and "not supplied" (in case of the given scenarios).
I agree with the general idea, but maybe we could use a different enumeration value for this purpose? Something like unspecified
with a huge warning that the use of this enumeration value should be avoided if possible.
If we consider that it makes no sense for compound data types containing values with differing units to have a single unit applied to them, then "not applicable" is reasonable. unspecified
is also acceptable to me. Possible text: Units not specified, but may be derived.
The body of the _units.code
definition could outline when use is possible, e.g. "unspecified
is used only for compound data types where values have differing units"
The body of the
_units.code
definition could outline when use is possible, e.g. "unspecified
is used only for compound data types where values have differing units"
If that is the type of definition you're after, then I would vote for mixed
as the name of the value to be used. It says "there are units in there, but we can't (easily) tell you what they are. Go figure it out.".
@rowlesmr By mixed
do you refer to the fact that the same structured value may have different units (e.g. as in case of _atom_type_scat.cromer_mann_coeffs
) or to the fact that different values of the same data item may have different units (e.g. depending on the value of some other data item)?
I am not strongly against mixed
, but to me it comes with certain additional semantics (e.g. mixed in what way?) whereas unspecified
is more of a catch-all term for all situations we are not capable of properly addressing under the current approach. However, I am fine either way.
Between this and data typing limitations mentioned in this discussion (https://github.com/COMCIFS/cif_core/pull/399#issuecomment-1577484771), maybe we should start a wishlist of all the features we would like to eventually fix/address in DDLm 5/6/7?
I meant in the first sense.
Looks like unspecified is coming out on top.
Let's go with unspecified then.
I will reopen this issue until we change the units of some items to unspecified
. The preliminary list of such items can be found in this comment (https://github.com/COMCIFS/cif_core/issues/409#issuecomment-1574719776) by @rowlesmr.
Is there a unit code to be used for dataitems that have mixtures of units? or should it just not be included?
cf
_atom_type_scat.cromer_mann_coeffs
, where the units are a mixture ofnone
andangstrom_squared
(also all the other scattering factor data items)
.
Could there be a
units.code
formixed
-type dataitems, where there are several different unit types? Not completely machine readable, but also shows that they aren't unitless, or arbitrary.