avcopan / Reaction-Model-Electronic-Structure-Schema

A repository for drafting a schema for reaction model electronic structure data
MIT License
0 stars 0 forks source link

Standardizing orientation and atom ordering for stationary points #2

Open avcopan opened 2 hours ago

avcopan commented 2 hours ago

I wanted to create a separate thread for discussing this issue raised by @flrnslbch:

We may even be able to find a canonical form of a point based on its connectivity, stereocenters, etc. by using InChIs and co (or our own idenitifers). This would greatly simplify automation across different datasets using this canonical represenation.

Originally posted by @flrnslbch in https://github.com/avcopan/Reaction-Model-Electronic-Structure-Schema/issues/1#issuecomment-2512162635

avcopan commented 2 hours ago

I agree that this would be very useful, and I have been planning to do this for our schema. My inclination, however, would be to avoid tying this to a string identifier like InChI. Instead, I think we can take a simpler and more flexible approach that is purely based on the geometry and requires no additional information about connectivity, stereochemistry, etc.

Probably the simplest approach would be something like the following:

  1. Use the principal axes of rotation as a standard reference frame (sorted from smallest to largest moment of inertia)
  2. Order the atoms by type (e.g. following the Hill ordering) and by xyz coordinates relative to the standard reference frame. For example:
    C  -2.0  0.0  0.0
    C  -1.0  1.0  0.0
    ... (other C atoms sorted by x,y,z)
    H  -3.0  1.0  0.0
    H  -3.0  1.0  1.0
    ... (other H atoms sorted by x,y,z)
    ... (other atoms alphabetically, sorted by x,z,y)
  3. Possibly, it would also make sense to round the coordinate values to some number of decimal places.

From my understanding, this is essentially what they do in MolSSI's QCArchive project.

The advantage of this approach is that it is very simple and allows you to easily check for duplicate structures. The disadvantage is that it the atom ordering will generally not be consistent between different conformers of the same species.

avcopan commented 1 hour ago

The other simple approach is to use InChI to define a standard atom ordering, but there are several problems with this:

  1. Considering stereochemistry opens the same can of worms that we have already referenced in our species discussion.
  2. InChI doesn't work for saddle points (transition states).

In AutoMech, we have addressed both of these points in a domain-specific way with AMChI, but generalizing it to cover all cases is impossible. For example: how would you define a unique identifier for the transition state between two van der Waals complexes?

avcopan commented 1 hour ago

What might make sense, though, is a sort of hybrid approach where we the atom ordering is partially based on a connectivity graph defined by distance thresholds and we only use the standardized coordinate values to break ties due to topologcal symmetry. This could allow for more flexibility where the atom is preserved even for fairly significant conformational changes. I'm not sure it could ever be perfect, but it could be worth doing.

Note: The use of distance thresholds still entails a lot of complexities for things like van der Waals complexes and transition states. For example, the breaking/forming bond lengths of a specific transition state conformer could vary significantly between levels of theory, so that you still end up with different atom orderings for the same TS conformer. We could adjust the distance thresholds to reduce the likelihood of this, but I don't think the possibility can be eliminated.