Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases
https://optimade.org/specification
Creative Commons Attribution 4.0 International
83 stars 37 forks source link

Extending `structures` (bonds, atom charges, etc) #426

Open merkys opened 2 years ago

merkys commented 2 years ago

OPTIMADE specification v1.0.1 defines a structure as a set of sites, occupied by mixtures of atoms, with each atom described by its chemical type, mass and occupancy (proportion in the mixture). Means for expressing disorder are also in place, defined quite similarly to CIF standard.

I wonder whether there would be an interest to add more chemical attributes to OPTIMADE structures such as:

Some of these attributes can be derived algorithmically (connectivity, lone pairs), but derivation algorithms are often based on heuristics and sometimes fail to arrive at "correct" result. Thus if these details are available at provider's side, it would be nice to have them communicated in OPTIMADE attributes.

JPBergsma commented 2 years ago

I think charges would be a useful property to add. Some atomistic simulations use the charge of an atom to calculate the interatomic potentials. In rare cases, the atomistic charges may also be the only way to distinguish chemical structures from one another. (e.g. Ions in cages can be stabilized in unusual oxidation states.)

Some formats like PDB also allow you to specify the connections between the atoms. I (ab)used this feature in the past for visualizing some of my course grained data, so I think this could be a useful feature as well.

A little over a year ago, I talked with some persons from materials cloud about which properties they would like to see standardized. These properties are mostly the results of calculations on the structures. The mentioned properties like:

For some of these, I do not really know what they are, so I can't really tell how useful these would be. They also mentioned space groups but these have already been added in PR#405.

Some other properties that I thought could be useful(mostly for use within trajectories, but some are also useful for structures too) to add are: Field Description
Temperature_set The temperature to which the thermostat was set.
Temperature_measured The measured temperature.
Velocities The velocities of the atoms/particles
Forces The force that is exerted on a particle
B factors Also known as Debye–Waller factor.
Constraint Force The Force required to maintain a reaction coordinate.
Time In case of a trajectory the time belonging to a particular frame
Remarks A field where some extra information can be given for this spefic entry that does not fit in any of the other fields
Various Energies we could have fields for the components of the energy such as kinetic energy, potential energy, total energy and electronic kinetic energy.
Enthalpy of formation The enthalpy of formation for the compound in the structure
merkys commented 1 year ago

I am mostly interested in chemical connectivity. However, I would expect the definition of chemical bond and its types to be quite involving. Could we adopt some already existing convention? CML, for instance, defines integer-numbered bond types for orders 1 to 3 (no 4), aromatic, unknown and other. To this list I would add order 4 and zero-order bonds. Anything else?

I saw @eimrek's addition to OPTIMADE paper manuscript about a database of covalent organic networks, thus it would be interesting to hear their opinion. Also pinging @BobHanson and @vaitkus for comments.

JPBergsma commented 1 year ago

I think it will be more informative to allow non-integer bond orders than just having a value of 0.
The fact that the number is not an integer, indicates that it is a non-classical bond. In the article you link to, they suggest to use 0 for these cases. But this may also be for backward compatibility. An option would be to allow extra bond properties to store more information about a bond. In that case, we could make a dictionary for each bond and store extra information about the bond there if needed. There are also more complex cases like three centred 2 electron bonds, perhaps we should also consider how to handle those.

Some d block metal dimers can have a bond order as high as 6, so I think we should allow the bond order to reach that value.

BobHanson commented 1 year ago

I agree. V3000 allows for dative and coordinate bonds. Whether you call these "zero order" or not, is up to you.

[ https://depth-first.com/articles/2021/11/17/ten-reasons-to-adopt-the-v3000-molfile-format/ ]

But bonding in general adds significant complexity to a model. Beware!

vaitkus commented 1 year ago

I think it might be quite difficult to agree on a single bonding model that covers every situation so we could start with something simple and then extend it in the future as needed. Some general thoughts on the model:

BobHanson commented 1 year ago

I agree with Antanas. My thought on aromaticity is that -- particularly with associated 3D structures, as we in this case -- standard Kekulé bonding is preferable. Aromaticity is not needed, since the 3D structure is there, and planarity, bond distances, aromaticity, and such can be easily derived from that.

This also relates to SMILES (clearly also a bonding model). We should be recommending non-aromatic SMILES. Explicit double bonds. That preference comes primarily from the fact that generally these SMILES will be targets -- that is, SMILES that actually represent structures. For searching structures, one may want an aromatic bonding model for the search pattern (Cc1ccccc1), but for a target one always needs the Kekulé bonding. Because Cc1ccccc1 will match CC1=CC=CC=C1, but CC1=CC=CC=C1 will not (is not supposed to) match Cc1ccccc1. From https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html:

SMILES is interpreted as a molecule, and it is the resultant molecule (not the SMILES string) which is subject to searching. Similarly, SMARTS is interpreted as a pattern; it is this pattern (not the SMARTS string) which is matched against molecules. For instance, the SMILES "C1=CC=CC=C1" (cyclohexatriene) is interpreted as the benzene molecule. This molecule will be matched by the SMARTS c1ccccc1, which is interpreted as the pattern "6 aromatic carbons in a ring". The SMARTS "C1=CC=CC=C1" makes a pattern ("six aliphatic carbons in a ring with alternating single and double bonds") which will not match benzene.

My point is not about SMILES, though. It's about bonding. I like this statement, that SMILES doesn't need any aromatic description to represent benzene. Same goes for what we are talking about using V3000 or whatever format.

Bob

Personally, I think they made a fundamental mistake in SMILES to allow aromatic descriptions there. Really they are much more useful and relevant in SMARTS, and because of this asymmetry of matching, are just a pain in SMILES.

Bob

On Mon, Feb 13, 2023 at 7:08 AM Antanas Vaitkus @.***> wrote:

I think it might be quite difficult to agree on a single bonding model that covers every situation so we could start with something simple and then extend it in the future as needed. Some general thoughts on the model:

  • It would be nice to be able to specify the bonding without explicitly assigning the bond type/order (e.g. only provide the connectivity graph). I guess this could be achieved by using the CML unknown bond type or something similar.
  • Maybe aromaticity should be a separate property of a bond rather than a bond type? This might be used to convey that certain bonds are aromatic, but described using the Kekulé notation. Furthermore, the OpenChemLib https://github.com/Actelion/openchemlib library actually differentiates between aromatic bonds that can resonate (e.g. in benzene) and the ones that have a more or less fixed bond order (e.g. in thiophene). Thus it is quite reasonable under some circumstances to define a bond as both being aromatic and having a specific bond order.

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/issues/426#issuecomment-1427913885, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCW6PNWLQBRTFGKTE5JDWXIW4FANCNFSM6AAAAAARHXTWUA . You are receiving this because you were mentioned.Message ID: @.***>

-- Robert M. Hanson Professor of Chemistry St. Olaf College Northfield, MN http://www.stolaf.edu/people/hansonr

If nature does not answer first what we want, it is better to take what answer we get.

-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900

We stand on the homelands of the Wahpekute Band of the Dakota Nation. We honor with gratitude the people who have stewarded the land throughout the generations and their ongoing contributions to this region. We acknowledge the ongoing injustices that we have committed against the Dakota Nation, and we wish to interrupt this legacy, beginning with acts of healing and honest storytelling about this place.

eimrek commented 1 year ago

hi all.

@merkys, our covalent organic framework databases don't contain bond orders and currently no intention to add it, as far as I'm aware. @ltalirz @yakutovicha correct me if i'm wrong.

Regarding atomic charges: there are multiple methods to calculate them: e.g. mulliken, hirshfeld, bader, ESP-derived, ...(https://mattermodeling.stackexchange.com/questions/1439/what-are-the-types-of-charge-analysis). Would this be something that the database provider just decides on which charges they provide? Still, it would be good to have information about method of calculation.

Regarding bond orders, there's a similar argument: there are multiple ways to calculate bond orders that can give different results. Additionally, one thing to keep in mind is how to represent non-kekule molecules, e.g. triangulene, and unpaired electrons and radical sites in general.

merkys commented 1 year ago

Thanks all for interesting responses. I agree that choosing the right representation for bond type/order will require a lot of thought. Thus I find @vaitkus's suggestion really appealing:

  • It would be nice to be able to specify the bonding without explicitly assigning the bond type/order (e.g. only provide the connectivity graph). I guess this could be achieved by using the CML unknown bond type or something similar.

Separating aromaticity from bond type/order is also a good suggestion.

How about starting from this:

"bonds": [ { "sites": [ 1, 2 ] } ]

I believe @eimrek's suggestion about specifying calculation methods should be promoted to more general level as other properties could benefit from such metadata as well.

BobHanson commented 1 year ago

I would prefer a more succinct format. Why duplicate "site" a zillion times? Maybe just array of arrays.

Suggest array of

[index1, index2, type]

Where type is reserved for future use and could be 0 for placeholder.

Mostly just reacting to needless byte bloat

On Fri, Feb 17, 2023, 6:51 AM Andrius Merkys @.***> wrote:

Thanks all for interesting responses. I agree that choosing the right representation for bond type/order will require a lot of thought. Thus I find @vaitkus https://github.com/vaitkus's suggestion really appealing:

  • It would be nice to be able to specify the bonding without explicitly assigning the bond type/order (e.g. only provide the connectivity graph). I guess this could be achieved by using the CML unknown bond type or something similar.

Separating aromaticity from bond type/order is also a good suggestion.

How about starting from this:

"bonds": [ { "sites": [ 1, 2 ] } ]

  • sites would be the single REQUIRED property giving a list of sites participating in a bond. As @JPBergsma https://github.com/JPBergsma noted, sites list could contain more than two sites.
  • JSON object describing a single bond could then later be expanded by introducing properties giving type/order, aromaticity and so on.

I believe @eimrek https://github.com/eimrek's suggestion about specifying calculation methods should be promoted to more general level as other properties could benefit from such metadata as well.

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/issues/426#issuecomment-1434603619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCW3FMVLNFVBZP2AJEO3WX5X6ZANCNFSM6AAAAAARHXTWUA . You are receiving this because you were mentioned.Message ID: @.***>

merkys commented 1 year ago

I would prefer a more succinct format. Why duplicate "site" a zillion times? Maybe just array of arrays.

I understand the pros of a more succinct representation, but I tried to retain consistency with the other OPTIMADE properties which use explicit keys. Moreover, suggested plain list representation would not allow for bonds of more than two atoms. Placeholder value of 0 might be perceived as zero order bond by some. It is better to avoid placeholders at all, if no "type" (or something like it) property is given in a bond object, nothing else but some sort of connectivity should be assumed.

ml-evs commented 1 year ago

It might be nice if this design could also capture generic "connectivity", and serve e.g., list of sites within some cutoff of another site in PBCs. Having pre-computed neighbour lists can really help accelerate some applications and could allow for some kind of local environment/oxidation state searching expressed via correlated list queries (though this might require species data to be added to each bond, maybe not favourable), e.g., "give me all structures that contain SiO4 tetrahedra"

It would then be up to the database to decide this "calculation method" still, e.g., what distance cutoff to use (constant, sum of ionic/vdw radii etc)

BobHanson commented 1 year ago

Yes, sorry, I was on my phone and, ah, still in bed... Meant to follow that with:

"That said, the more use of associative arrays, the more easily extended this will be."

Q: What else do we have that references sites like this?

On Fri, Feb 17, 2023 at 8:17 AM Andrius Merkys @.***> wrote:

I would prefer a more succinct format. Why duplicate "site" a zillion times? Maybe just array of arrays.

I understand the pros of a more succinct representation, but I tried to retain consistency with the other OPTIMADE properties which use explicit keys. Moreover, suggested plain list representation would not allow for bonds of more than two atoms. Placeholder value of 0 might be perceived as zero order bond by some. It is better to avoid placeholders at all, if no "type" (or something like it) property is given in a bond object, nothing else but some sort of connectivity should be assumed.

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/issues/426#issuecomment-1434712127, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCWZL7EME6D5CLZLATWTWX6CATANCNFSM6AAAAAARHXTWUA . You are receiving this because you were mentioned.Message ID: @.***>

-- Robert M. Hanson Professor of Chemistry St. Olaf College Northfield, MN http://www.stolaf.edu/people/hansonr

If nature does not answer first what we want, it is better to take what answer we get.

-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900

We stand on the homelands of the Wahpekute Band of the Dakota Nation. We honor with gratitude the people who have stewarded the land throughout the generations and their ongoing contributions to this region. We acknowledge the ongoing injustices that we have committed against the Dakota Nation, and we wish to interrupt this legacy, beginning with acts of healing and honest storytelling about this place.

merkys commented 1 year ago

@BobHanson

Yes, sorry, I was on my phone and, ah, still in bed... Meant to follow that with: "That said, the more use of associative arrays, the more easily extended this will be." Q: What else do we have that references sites like this?

OPTIMADE has assemblies to describe disorder, and that uses similar level of verbosity.

merkys commented 1 year ago

This might be slightly off-topic, but how does one get atom bonding out of QM calculations? Can existence of bonds/their types be objectively detected via QM, or would one need some heuristic (i.e., distance-based criterion) to derive them? Pinging @gmrigna.

eimrek commented 1 year ago

This might be slightly off-topic, but how does one get atom bonding out of QM calculations? Can existence of bonds/their types be objectively detected via QM, or would one need some heuristic (i.e., distance-based criterion) to derive them? Pinging @gmrigna.

Here's a small overview of QM bond order methods: https://mattermodeling.stackexchange.com/questions/901/what-are-the-types-of-bond-orders/1508

Most of these (or at least the popular ones, Wiberg, Mayer and Laplacian, which I also have some experience with) are fully determined based on the electronic structure (so, the density/density matrix/occupied molecular orbitals/or derived orbitals) and the atom-atom distance is not "explicitly" used.

merkys commented 1 year ago

Suggestion for a queryable property: