Open avcopan opened 2 hours ago
I agree that this would be very useful, and I have been planning to do this for our schema. My inclination, however, would be to avoid tying this to a string identifier like InChI. Instead, I think we can take a simpler and more flexible approach that is purely based on the geometry and requires no additional information about connectivity, stereochemistry, etc.
Probably the simplest approach would be something like the following:
C -2.0 0.0 0.0
C -1.0 1.0 0.0
... (other C atoms sorted by x,y,z)
H -3.0 1.0 0.0
H -3.0 1.0 1.0
... (other H atoms sorted by x,y,z)
... (other atoms alphabetically, sorted by x,z,y)
From my understanding, this is essentially what they do in MolSSI's QCArchive project.
The advantage of this approach is that it is very simple and allows you to easily check for duplicate structures. The disadvantage is that it the atom ordering will generally not be consistent between different conformers of the same species.
The other simple approach is to use InChI to define a standard atom ordering, but there are several problems with this:
In AutoMech, we have addressed both of these points in a domain-specific way with AMChI, but generalizing it to cover all cases is impossible. For example: how would you define a unique identifier for the transition state between two van der Waals complexes?
What might make sense, though, is a sort of hybrid approach where we the atom ordering is partially based on a connectivity graph defined by distance thresholds and we only use the standardized coordinate values to break ties due to topologcal symmetry. This could allow for more flexibility where the atom is preserved even for fairly significant conformational changes. I'm not sure it could ever be perfect, but it could be worth doing.
Note: The use of distance thresholds still entails a lot of complexities for things like van der Waals complexes and transition states. For example, the breaking/forming bond lengths of a specific transition state conformer could vary significantly between levels of theory, so that you still end up with different atom orderings for the same TS conformer. We could adjust the distance thresholds to reduce the likelihood of this, but I don't think the possibility can be eliminated.
I wanted to create a separate thread for discussing this issue raised by @flrnslbch:
Originally posted by @flrnslbch in https://github.com/avcopan/Reaction-Model-Electronic-Structure-Schema/issues/1#issuecomment-2512162635