The provided format of `_atom_type.symbol` is ambiguous

vaitkus commented 1 year ago

The issue is potentially related to issue #266.

The current definition of _atom_type.symbol reads:

The identity of the atom specie(s) representing this atom type.
Normally this code is the element symbol followed by the charge
if there is one. The symbol may be composed of any character except
an underline or a blank, with the proviso that digits designate an
oxidation state and must be followed by a + or - character.

This seems a bit ambiguous since the code is described as element symbol followed by the charge while the later sentence states that digits designate an oxidation state. The original DDL1 definition of this item omits the charge part and do not forbid spaces, but is otherwise virtually the same except that the charge part is omitted:

The code used to identify the atom species (singular or plural)
representing this atom type. Normally this code is the element
symbol. The code may be composed of any character except an
underscore with the additional proviso that digits designate an
oxidation state and must be followed by a + or - character.

The updated DDLm definition was already present in the first GitHub commit [1] so the exact origin of this change does not seem to be documented.

[1] https://github.com/COMCIFS/cif_core/blob/3b2464579f769f53a974b202ecc731a2d89af565/core_struc.dic#L2400

jamesrhester commented 1 year ago

I don't understand how this is ambiguous. The intention is that the type symbol is of the form Xxn(+/-), where Xx is the atom type, and n is the oxidation state, with the n(+/-) optional. If the oxidation state is absent, then the Xx may be an arbitrary alphanumeric string, as given in the examples.

The DDLm variant of the definition occurs in the earliest copies of the DDLm work that I have, from 2007, so I think it was an attempt to improve on the DDL1 definition, given that _atom_type.symbol is used to index into data tables.

I assume that the point of dummy or FeNi as atom type symbols is to allow incompletely understood structures to be described. Of course ? and . are available, but FeNi conveys information that those do not.

If you can suggest a more precise definition that would be great.

vaitkus commented 1 year ago

But doesn't the updated DDLm definition conflate the ideas of an atomic charge and oxidation state? How would the described "normal" code with an element symbol and a trailing charge look and how would it differ from the one with the oxidation state?

jamesrhester commented 1 year ago

Yes, it does conflate them. So Cu2+ is copper with two electrons missing. As I understand it the significance of the trailing charge is to determine the appropriate form factor to use. What terminology is best for this?

vaitkus commented 1 year ago

What terminology is best for this?

This is the thing I would also like to find out. Consider the two examples: a) OH- ion. Formal charge of O is 1-, oxidation state is -2. b) H2O2 molecule. Formal charge of O is 0, oxidation state of O is -1.

The templ_enum.cif file contains entries in several table with the symbol O1-. Which of the two examples would it best describe? I guess the a) one?

However, note that the ATOM_TYPE loop also contains a separate the _atom_type.oxidation_number data item. If we assume that the number in the symbol refers to the charge, then it becomes impossible to specify multiple entries for the same element with the same charge but different oxidation states (e.g. as in H2O2 (0/-1) and H2O (0/-2).

rowlesmr commented 1 year ago

(prefix: I'm a physicist, so I think (or, at least, I thought I did) oxidation state and charge is the same thing) .

Just by way of example, when I'm modelling alpha-alumina, I use Al3+ and O2- scattering factors from Waasmair and Kirfel, but when I model silicon nitride, I use Si and N. .

What is the function of an _atom_type.symbol? In TOPAS* (at least), it only functions as to choose the correct set of scattering factors. So, to that end, the value is a charge. This is backed up by the definition of _atom_type.element_symbol (see below), which talks about an ion-to-element enumeration.

In order to have different charge/oxidation combinations**, you would need to loop _atom_type.key*** (this is explicitly needed), _atom_type.symbol, _atom_type.oxidation_number, and (the new) _atom_type.site_label. That allows you to assign different oxidation states to different atoms of the same charge based on which site they appear in in the structure. .

Other, related, data names are: _atom_type.element_symbol

    Element symbol for of this atom type. The default value is extracted
    from the ion-to-element enumeration_default list using the index
    value of _atom_type.symbol.

_atom_site.type_symbol

    A code to identify the atom specie(s) occupying this site.
    This code must match a corresponding _atom_type.symbol. The
    specification of this code is optional if component_0 of the
    _atom_site.label is used for this purpose. See _atom_type.symbol.

_atom_site.label

     This label is a unique identifier for a particular site in the
     asymmetric unit of the crystal unit cell. It is made up of
     components, _atom_site.label_component_0 to *_6, which may be
     specified as separate data items. Component 0 usually matches one
     of the specified _atom_type.symbol codes. This is not mandatory
     if an _atom_site.type_symbol item is included in the atom site
     list. The _atom_site.type_symbol always takes precedence over
     an _atom_site.label in the identification of the atom type. The
     label components 1 to 6 are optional, and normally only
     components 0 and 1 are used. Note that components 0 and 1 are
     concatenated, while all other components, if specified, are
     separated by an underline character. Underline separators are
     only used if higher-order components exist. If an intermediate
     component is not used it may be omitted provided the underline
     separators are inserted. For example the label 'C233__ggg' is
     acceptable and represents the components C, 233, '', and ggg.
     Each label may have a different number of components.

* I've just realised that I use _atom_site_type_symbol in pdcifplotter

** Hooray! You've crystallised hydrogen peroxide hydrate!

*** There are only two datanames that have .key; the remainder have .id: _space_group_generator.key and _atom_type.key.

vaitkus commented 1 year ago

Thank you for your insights!

From a chemical standpoint (formal) charge and oxidation state are indeed not the same. However, in monoatomic ions they are. So would it be true to say that most crystallographic application operate under the assumption that the charge/oxidation state is only specified for monoatomic ions?

In order to have different charge/oxidation combinations**, you would need to loop _atom_type.key*** (this is explicitly needed), _atom_type.symbol, _atom_type.oxidation_number, and (the new) _atom_type.site_label. That allows you to assign different oxidation states to different atoms of the same charge based on which site they appear in in the structure. .

After thinking some more on this, I guess the different charge-oxidation state combinations could be recorded under the current setup using atom site codes like "O2-/a" and "O2-/b", e.g.:

loop_
_atom_site.type_symbol
_atom_site.oxidation_state
O2-/a 2
O2-/b 1

jamesrhester commented 1 year ago

It appears to me that O2-/a is not a valid form for _atom_site.type_symbol.

I think it is reasonable to assume that atomic charges refer to monoatomic ions. I say this because the standard independent atom model assumed by current core CIF uses independent atoms, and the form factors chosen for those atoms are based on the charge of the independent atom. Perhaps we should emphasise this point, and advise that _atom_site.oxidation_state can be used to indicate oxidation state, which is otherwise assumed identical to the charge.

In general it would be good to clarify how the _atom_site.oxidation_state is either (a) determined or (b) affects the calculation of structure factors.

COMCIFS / cif_core

The provided format of `_atom_type.symbol` is ambiguous #378