Additional requirements on formulas

merkys commented 3 years ago

In https://github.com/Materials-Consortia/optimade-python-tools/pull/986 I have proposed two additional requirements for formulas (chemical_formula_reduced, chemical_formula_hill and chemical_formula_anonymous) dictated by my personal "common sense":

Formulas MUST NOT be empty strings. If a formula is unknown, null value MUST be used. Empty formula therefore looks as if the structure in question does not have any atoms, and I do not think this is something that should be allowed.
Formulas MUST NOT contain element proportions equal to 0. Trivially, if no such element exists in a structure, it MUST be excluded. If, however, minute proportion of a certain element is observed, it MUST NOT be rounded to 0 (specification requires rounding to integers for chemical_formula_reduced and chemical_formula_anonymous). I think a better approach than rounding would be to multiply all proportions by some number to eliminate fractional proportions.

This seems related to #361.

JPBergsma commented 3 years ago

I agree that formulas MUST NOT be empty strings. I am however wondering what counts as a minute amount. I would find it rather ugly to have very large element proportions. If an element is only present as a dopant, I would allow this element to not be in the formula.

merkys commented 3 years ago

I am however wondering what counts as a minute amount.

In my proposal I meant > 0. So if element exists in the structure, it has to be mentioned in formula no matter the actual amount.

I would find it rather ugly to have very large element proportions.

I agree that large proportions are ugly, but they retain the information.

If an element is only present as a dopant, I would allow this element to not be in the formula.

If we all can agree on a formal definition for a dopant, then I guess such elements could in principle be excluded. However, element exclusion negatively affects queries which might be interested in dopants.

JPBergsma commented 2 years ago

On second thought, dopants are probably not such a problem as the composition is made on pupose and in that sense it is intended as a material on it self. It is also likely that there is data on the pure material as well.

For experimental systems it could be more problemetic as a material can have impurities and defects. In that case a material may have a composition like Ca198Na2O199 Do we want users to be able to find this material if they look for CaO? I think it should be found. So I would place CaO in the chemical_formula_reduced field and Ca198Na2O199 in the Chemical Formula descriptive field to show that the material was impure.

There is also the elements ratio field that does keep the exect ratios between the elements and this field can therefore also be used to find doped materials. So we would not lose information if we rounded the amount of an element in the chemical_formula_reduced field.

merkys commented 2 years ago

For experimental systems it could be more problemetic as a material can have impurities and defects. In that case a material may have a composition like Ca198Na2O199 Do we want users to be able to find this material if they look for CaO? I think it should be found.

Fair enough. This can be achieved now by querying elements HAS ONLY [ "Ca", "O" ] no matter what conventions for formulas are used.

So I would place CaO in the chemical_formula_reduced field and Ca198Na2O199 in the Chemical Formula descriptive field to show that the material was impure.

Could you propose a programmatic way to arrive to CaO from Ca198Na2O199?

rartino commented 2 years ago

Formulas MUST NOT be empty strings. If a formula is unknown, null value MUST be used. Empty formula therefore looks as if the structure in question does not have any atoms, and I do not think this is something that should be allowed.

Nothing right now forbids a structure without any atoms (i.e., unit cell but no coordinates), and I think the reasonable chemical formula for that is the empty string. Does it not make sense to allow that? I suppose I can try to come up with some scenarios where it can be useful. You could possibly modify your proposed requirement to more be a clarification that empty formulas MUST only appear for structures without any atoms. (However - eh - how was it with those non-specifically placed hydrogens? Since I don't use this, I don't quite remember everything they make possible - is it perhaps possible to come up with a pathological example of only unspecifically placed hydrogens which should have an empty chemical_formula_descriptive?)

I think I agree with the limitation that, e.g. Na0 (0 = zero, not oxygen) should not be used to indicate disordered systems with very small concentrations, but I suppose the question is what one should do for those cases.

ml-evs commented 2 years ago

Just to join up this conversation with https://github.com/Materials-Consortia/OPTIMADE/issues/361, which approaches a similar problem from the elements and elements_ratios side.

One suggestion for impurities with vanishing/unknown concentration is to add them to the elements and elements_ ratios lists but exclude them from chemical_formula_reduced. I am more comfortable with elements_ratios being zero for one species (as all the normal filtering semantics on floats would work) compared to adding it to formulae where it would break string matching.

In this case, it might even be sensible to add a new structure_feature tag to such an entry so that it can be filtered easily (impurity?). We would just need to define the rules for when this tag should be added (e.g. a minimum elements_ratios). The only problem I can see is that we would be treating adatom and substitutional impurities differently from vacancies and stoichiometry-preserving defects which might be misleading for users. The alternative would be for queries that want to return only pristine structures to add something like elements_ratios HAS ALL > 0.01 (which I do not think is well-supported).

Could you propose a programmatic way to arrive to CaO from Ca198Na2O199?

A query that could return Ca198Na2O199 and related structures around CaO could be:

elements:elements_ratios HAS ALL "Ca":>0.49, "O":>0.49.

Although this is an optional filter feature, a database serving defected structures should probably implement it... With the additional suggestion above of adding the Na defect with elements_ratios = [0.5, 0, 0.5], the query above would become elements:elements_ratios HAS ALL "Ca":=0.5, "O":=0.5 (where more awkward ratios would provide problems).

JPBergsma commented 2 years ago

Fair enough. This can be achieved now by querying elements HAS ONLY [ "Ca", "O" ] no matter what conventions for formulas are used.

This would also return calcium peroxide CaO2. (and "O2" and "Ca" but you can prevent that with 'HAS ALL ["Ca","O"] AND nelements=2')

@Ratino I guess you could consider a vacuum a material in some respects. If a database contains data on the polarizability of different materials, they may include vacuum, as it also has a measurable polarizability.

Perhaps a small PR can already be created about the things about which we agree:

If the reduced chemical formula is unknown, it should be null.
Elements proportions MUST not be 0.

We can then turn the discussion about how to handle impurities into a separate topic. Do databases have information about the purity of their structures? If not, it is not really useful to have such a discussion here.

merkys commented 2 years ago

Fair enough. This can be achieved now by querying elements HAS ONLY [ "Ca", "O" ] no matter what conventions for formulas are used.

This would also return calcium peroxide CaO2. (and "O2" and "Ca" but you can prevent that with 'HAS ALL ["Ca","O"] AND nelements=2')

Right, elements HAS ONLY [ "Ca", "O" ] will return CaO2, but so would elements HAS ALL ["Ca","O"] AND nelements=2. To filter out CaO2 one would need to query for elements:elements_ratios HAS ALL "Ca":>0.49, "O":>0.49, like @ml-evs pointed out.

@ratino I guess you could consider a vacuum a material in some respects. If a database contains data on the polarizability of different materials, they may include vacuum, as it also has a measurable polarizability.

Not sure if current specification is ready to describe such structures, but why not introduce them in the future.

Perhaps a small PR can already be created about the things about which we agree:

If the reduced chemical formula is unknown, it should be null.

Elements proportions MUST not be 0.

Agree. Shall we add that formulas MUST be empty strings only for vacuum? Structures with inspecifically placed hydrogen atoms would still have hydrogen atoms in formulas, I guess. Not sure, though, what to do about structures built from subatomic particles, for example, sole electrons, if only such exist.

We can then turn the discussion about how to handle impurities into a separate topic. Do databases have information about the purity of their structures? If not, it is not really useful to have such a discussion here.

At least experimental structural databases have information about impurities of sites.

JPBergsma commented 2 years ago

Agree. Shall we add that formulas MUST be empty strings only for vacuum? Structures with inspecifically placed hydrogen atoms would still have hydrogen atoms in formulas, I guess. Not sure, though, what to do about structures built from subatomic particles, for example, sole electrons, if only such exist.

For me it is ok to specify that formulas MUST be empty strings only for vacuum. In my opinion, inspecifically placed hydrogen atoms MUST appear in the chemical formula.

Good point about the solvated electrons / electrides. At low temperature, they can be quite stable. ~~The most logical would be to use the small letter "e".~~ I just realized this may be confusing when distinguishing C + e from Ce in the chemical formula fields. Perhaps an "E" would be better, as it fits the rules for the other elements.

I guess we could use the centre of the electron density as the position. Although, I am not sure how to define the positions for a metallic electride. It would definitively be a good idea to add this to the standard, although this would affect more than just the elements field, so I think it would be better to create a separate issue about this.

merkys commented 2 years ago

Electrons and neutrons could be marked as e and n respectively, if only we mandate that such symbols appear in the beginning of the formula. This way no capital letter will appear before e and n if these elements appear in the formula.

However, I am not sure this is not an overkill. Will there be structures made up from electrons or neutrons entirely?

JPBergsma commented 2 years ago

I do not see how a structure could be made of just electrons and neutrons. Neutrons that are not bound in atomic nuclei will decay quickly, and they can not form bound states. Electrons repel each other. So I do not see how you can form a chemical structure with just electrons and neutrons.

The only scenario I can think of with both free electrons and neutrons is when you would study the effect of (neutron/beta)radiation on a material. In that case, when an unbound neutron is "fired" at a material, an atom could get ionized. In that case, the trajectory would also contain a separate electron. Other than that, I do not see how an unbound neutron could appear in a trajectory or structure.

rartino commented 2 years ago

One of the most important papers for DFT, the Ceperley-Alder Monte Carlo simulations that more or less all LDA correlation functionals are based on [ http://dx.doi.org/10.1103/PhysRevLett.45.566 ; 11k citations] deals with "structures" of only electrons. When you get down to low densities, you get something called a Wigner crystal, and the high density limit is the famous uniform electron gas.

If these can be called "materials" can be discussed, but I suppose it could be relevant to be able to represent them...

merkys commented 2 years ago

Thanks for a link, @rartino. So having electron (I suggest e) as possible chemical symbol makes sense. I just want to make sure e would not break anything in formulas:

In chemical_formula_reduced, elements are ordered alphabetically. e < A, thus there should be no problems.
In chemical_formula_hill, using Hill formula notation, e could be written on the leftmost position in the formula. This way it will appear before any capital letter.

I highly doubt IUPAC will ever standardize a chemical symbol starting with minor letter. A small fraction of 26^2 possible double letter symbols is already taken.

rartino commented 2 years ago

I'm not sure why it is a good idea to indicate extra subatomic particles in the chemical formulas at all? Are there any examples of anyone doing that? My vote is to skip the e+n extension until someone shows up with a relevant use case for that.

And then, rather than to relate an empty chemical formula to specifically vacuum, just say that an empty chemical formula MUST only occur for a structure with no atoms. Does that work?

merkys commented 2 years ago

I agree with @rartino, there is no need to over-complicate right now.

JPBergsma commented 2 years ago

I'm not sure why it is a good idea to indicate extra subatomic particles in the chemical formulas at all? Are there any examples of anyone doing that? My vote is to skip the e+n extension until someone shows up with a relevant use case for that.

These are some examples of chemical formula's with electrons: [Ca₂₄Al₂₈O₆₈]⁴⁺4e^- https://pubs.acs.org/doi/10.1021/ol701885p [Na(NH₃)₆]⁺e⁻ (https://en.wikipedia.org/wiki/Electride) [La₈Sr₂(SiO₄)₆]⁴⁺:4e^– https://www.nature.com/articles/s41535-017-0053-4

Leaving the electrons out will give you a different material with different properties. It would also make it more difficult to find electrides in the databases. It would probably be good to also have a field for the charge distribution on the atoms, as different charge distributions will give different materials.

rartino commented 2 years ago

@JPBergsma Fair enough, but your examples only make sense because they charge balance the ^{N+} in those formulas - which is a notation we also do not support, not even in chemical_formula_descriptive. Should we then support that as well? And, does the separate e:s add anything to those formulas that isn't already communicated with the ^{N+} notation?

merkys commented 2 years ago

These are some examples of chemical formula's with electrons: [Ca24Al28O68]4+4e- https://pubs.acs.org/doi/10.1021/ol701885p [Na(NH3)6]+e− (https://en.wikipedia.org/wiki/Electride) [La8Sr2(SiO4)6]4+:4e– https://www.nature.com/articles/s41535-017-0053-4

Leaving the electrons out will give you a different material with different properties. It would also make it more difficult to find electrides in the databases. It would probably be good to also have a field for the charge distribution on the atoms, as different charge distributions will give different materials.

Neither of the formulas considered in the initial post on this issue support charges. While having ionic composition formulas would be nice to have, I think this is out of scope for this particular issue.

merkys commented 2 years ago

I would like to revive the thread. There have been some nice future-proof suggestions, but how about introducing just the constraints expressed in my original post, for the time being? Non-empty formula and non-zero element proportion constraints have already been implemented in optimade-python-tools (see https://github.com/Materials-Consortia/optimade-python-tools/pull/986) and are suggested to be included into OpenAPI schemas (see https://github.com/Materials-Consortia/schemas/pull/8).

I understand that some structures will become non-expressable (vacuum structures; structures with very tiny proportions of some element), but at the time being the specification does not say how such formulas should be interpreted.

rartino commented 2 years ago

Echoing what I said previously, I want to be allowed to have a zero length cartesian_site_positions and then set the chemical formulas to the empty string. It may sound silly, but can be the outcome of certain automatic processes that generate structures, and I see no reason to disallow them from being represented.

I am in favor of stating that the proportion constant must be strictly a positive number.

ml-evs commented 2 years ago

I'm perhaps slightly more reticent to allow empty strings than others, as I think it goes against the spirit of what we laid out in the description of chemical_formula_x fields (Is a periodic box of vacuum a chemical? Is jellium?) --- however, I would not be against explicitly loosening the spec in this regard. I think my standpoint is similar to, my intrepretation, at least, of Andrius' (@merkys), i.e., the constraints on non-empty strings and non-zero proportions are implied by the current spec, and thus could be included in our v1.1 OpenAPI schemas, but that we could consider loosening this for 1.2 (i.e., simply changing the regex we would be introducing for the OPTIMADE v1.1 schema in https://github.com/Materials-Consortia/schemas/pull/8 and adding example values for these edge cases). Of course, if we really think the specification as it stands allows such empty formulae then we can just manually alter the optimade-python-tools-produced schema for 1.1 and drop the regex altogether.

rartino commented 2 years ago

@ml-evs If you say it these constraints are implied in the current specification, what passage of text do you support this on? Personally I would say our requirements/conventions are just simply unclear on this (which is why this issue is good - it should be clarified either way). However, I would also generally argue when it comes to schemas that unclear = allowed.

If we are moving this to a semantic discussion ("what is a chemical?") it should probably start with whether an OPTMADE structure generalizes to systems of zero atoms. I'd argue that I want them to, both based on semantics and utility. Then it follows:

If structures of zero atoms are not allowed, I think we needs quite a bit of clarification also for other fields (e.g., nelements > 0, nsites > 0, length of cartesian_site_positions > 0, etc.)
If structures of zero atoms are allowed, then what should I set the chemical formulas to? We specifically disallow null, so, empty string seems as the only viable choice for these systems?

merkys commented 2 years ago

I agree that "nonempty formula" part of this discussion boils down to whether OPTIMADE allows structures with 0 atoms or not. To me it seems that for structures with 0 atoms, structural properties like lattice_vectors and space_group_* have no sense.

rartino commented 1 year ago

for structures with 0 atoms, structural properties like lattice_vectors and spacegroup* have no sense.

I don't see why one cannot define a perfectly fine (non-primitive) unit cell out of three lattice vectors and have it contain zero atoms; to me these are two almost completely separate things. space_group_ is a bit more tricky, but it also isn't required to be provided for these. (If I had to set it, I suppose I would have to indicate symmetry under all symmetry operations).

For a standard representation of "structures" in materials science, is it not better to err on the side allowing too much, than too little? I already gave one use case above, I think I can come up with more if pressed. What problem is it that you are trying to solve by forbidding people to transmit information about empty unit cells via OPTIMADE?

ml-evs commented 1 year ago

It sounds like empty formulae are desirable after all (pending an excoriating rebuttal from @merkys), so I have just relaxed the constraint on non-empty formulae in optimade-python-tools (pending a review). (Re-reading my original comment I don't see anywhere in the spec that implies they cannot be empty, beyond the semantics of the term 'chemical' which I agree is not a relevant discussion to have here!)

The "0-proportion" elements is sensible though, so I have left that in.

Do we need a PR to tighten the wording in the spec on this, or can this be closed?

Materials-Consortia / OPTIMADE

Additional requirements on formulas #388