Isomorphism checking when the new adjacency list format is used

nickvandewiele commented 9 years ago

With the migration to the new adjacency list format, and new atom attributes like number of unpaired electrons, number of electron pairs, and charge, and molecule attributes like spin multiplicity, it's quite possible that the isomorphism methods that compare groups to molecules and molecules to molecules will result in false positive, and false negative matches.

I will:

look into how the new adjacency list format impacts isomorphism matching
see if the current unit tests are adequate in testing this, and if not, add some new tests
report and modify incorrect behavior of isomorphism matching methods in RMG-Py

nickvandewiele commented 9 years ago

The graph isomorphism and subgraph isomorphism algorithms are not that easily to follow especially given that they can be applied in different circumstances, so I'm posting parts of the architecture of these algorithms.

First, some definitions: graph isomorphism is an algorithm that determines whether graph A and graph B are the same graphs.

subgraph isomorphism is an algorithm that determines whether graph A is part of graph B.

Usage in RMG: Graph isomorphism is used to check whether:

a newly created molecule is present in the core/edge of the RMG simulation,
we can use the thermochemistry of a specific molecule in a thermolibrary for a new molecule
we can use the kinetics of a specific reaction in a reaction library/seed mechanism for a new reaction

Subgraph isomorphism is used to check whether:

a benson group is present in a molecule
a kinetics group is present in a reactant of a reaction

Available methods: RMG-Py has two kinds of methods that access graph and subgraph isomorphism:

isIsomorphic, isSubGraphIsomorphic: returns a yes/no answer
findIsomorphism, findSubgraphIsomorphisms: returns a mapping between target and query structure.

The 4 methods above can be found in either of the two main data structures of RMG: Molecule and Group, both found in the rmgpy/molecule subpackage. Note that findIsomorphism emphasizes that there is one single unique isomorphism between 2 graphs under the condition that one does make identical atoms distinguishable. On the other hands, a fragment (subgraph) may have multiple distinct and unique matches in a target graph, therefore, the method names uses the plural: findSubgraphIsomorphisms.

Although two methods exist for each operation, one in molecule.Molecule, one in group.Group, both call the same underlying methods in graph.Graph. Graph is a very generic datastructure in which vertices and edges are the main components. Both Molecule and Group inherit from Graph, and decorate the basic data structure with some other attributes like electrons.

Graph, on its own, does not contain the actual algorithm, but redirects it to the module vf2, that contains the actual algorithm, aka VF2 (Vento-Foggia). In addition, although there are 4 methods per class, they all refer to the same underlying method in VF2:

cdef isomorphism(self, Graph graph1, Graph graph2, dict initialMapping, bint subgraph, bint findAll):

in which the flags subgraph and findAll are used to add some flavor (eg graph vs subgraph isomorphism). Besides the fact that the algorithm is recursive in nature, one method call in feasible(vertex1, vertex2) is really important:

        if self.subgraph:
            if not vertex1.isSpecificCaseOf(vertex2): return False
        else:
            if not vertex1.equivalent(vertex2): return False

The two methods isSpecificCaseOf and equivalent are overridden in the two implementations: Atom and GroupAtom in Molecule and Group respectively.

Keep in that while the object that calls the method is always part of the class it belongs to, the parameter object of the method can either belong to Atom or GroupAtom. Therefore, each implementation of the method has a conditional in which it is checked to which class the parameter object belong.

The difference between Atom and GroupAtom is that for the latter each atom attribute (atomtype, radicalElectrons, charge, lonePairs) is a list of possible values, rather than one specific value. Therefore, the equivalent and isSpecificCaseOf methods need to iterate over the list rather than checking a single value.

The only remaining complexity is the comparison of atomtypes. atomtype.AtomType also has equivalent and isSpecificCaseOf method.

AtomType.equivalent(atomtype) roughly checks whether two atomtypes are connected by a path that goes straight up or straight down the atom type tree.

E.g.: 'R!H'.equivalent('C')returnsTrue 'C'.equivalent('R!H') returnsTrue 'C'.equivalent('Os') returnsFalse

In other words, it is a symmetric function.

AtomType.isSpecificCaseOf(atomtype) is not symmetric, as the name suggests: 'R!H'.isSpecificCaseOf('C') returns False 'C'.isSpecificCaseOf('R!H') returnsTrue

To put everything together, the following example shows the comparison of two groups 'R!H' and 'C' and the two isomorphism methods:

'R!H' isIsomorphic.( 'C' )returnsTruesince it calls'R!H'.equivalent('C') 'R!H' isSubGraphIsomorphic.( 'C' )returnsFalse since it calls 'R!H'.isSpecificCaseOf('C')

rwest commented 9 years ago

This type of analysis and write-up is so helpful (thanks @nickvandewiele!) that it must end up in our developer documentation, and not just as comments on an issue (though this is also a good place). Perhaps even in docstrings somewhere in the code that end up in the documentation via autodoc?

nickvandewiele commented 9 years ago

@connie and I discussed some additional implications of the new adjacency list format on (sub)graph isomorphism.

As explained above, (sub)graph isomorphism boils down to Atom and Bond comparisons, and each implementation of the equivalent and isSpecificCaseOf methods of vertices/edges defines when 2 atoms or bonds are equal, and hence when two graphs are isomorph.

Since we transferred the information on spin multiplicity from an atom attribute (remember: 1 C 2T ...) to a Molecule or Species attribute, the graph isomorphism methods have no way of telling two Molecule's or Groups apart anymore when they have different multiplicities:

E.g.:

mol = Molecule().fromAdjacencyList("""
multiplicity 1
1 C u2 p0 c0
""", saturateH=True)

mol2 = Molecule().fromAdjacencyList("""
multiplicity 3
1 C u2 p0 c0
""", saturateH=True)

print mol.isIsomorphic(mol2) #True

As a result, unless we add additional multiplicity comparisons on the graph isomorphism class levels of Molecule or Species, this will lead to false positives.

Moreover, the above example not only also shows a false positive of graph isomorphism, it also shows that people can define molecules that do not comply with Hundt's rule:

Hundt's rule
every orbital in a subshell is singly occupied with one electron before any one orbital is doubly occupied, and all electrons in singly occupied orbitals have the same spin

The above molecule with 2 unpaired electrons on a single atom, should always have multiplicity 3. One way to enforce this is to add a consistency check in the Molecule().fromAdjacencyList(...), similar to the consistency checking of the combination of u, p, c, and of the multiplicity and u, and to thrown an error when the input adjacency list does not comply with Hundt's rule.

rwest commented 9 years ago

I think the graphs are isomorphic, although the molecules are unique if they have a different spin. So you're right, we would need equivalency checks in the Molecule or Species class to check the multiplicity before saying the molecules/species are identical (should we use the word isomorphic in this case?).

Does Hundt's rule only describe ground states? Are you saying it should not be possible to represent singlet CH2 in RMG? Almost all kinetic models out there have both CH2 and CH2(S) in them....

nickvandewiele commented 9 years ago

The next issue that came up in the discussion is the treatment of unspecified lone pair electrons in Groups.

Consider the following example of a generic R group in the old format:

1 R 0 {...}

or in the new format, this would translate into:

1 R u0 p??? ...

where the ??? can be filled in with what seems to be most appropriate value.

Right now, if we can't find a 'lone pair' identifier in the new format, adjlist.fromAdjacencyList(...) will add None when Groups are concerned: https://github.com/GreenGroup/RMG-Py/blob/master/rmgpy/molecule/adjlist.py#L474

else:
                if group:
                    lonePairs.append(None)

The None translates into an empty list and leads to non-matching between generic R groups and whatever Atom we match it against, because the atom comparison methods iterate over an empty list: https://github.com/nickvandewiele/RMG-Py/blob/isomorphism_check_new_format/rmgpy/molecule/molecule.py#L207

for lp in ap.lonepairs:
                if self.lonePairs == lp: break
else:
                return False

This is probably not the behavior that we want, since an R group should match at least some atoms.

On the other hand, we don't want to create list of all possible lone pairs e.g. [1,2,3,4,5] since that would match singlet carbene atoms 1 C u0 p1 c0 to the generic R node. Ideally, we would fill in the lone pair possibilities of the generic R group with the default lone pair count of the element it is matched against.

E.g. if we would match an R group against an Oxygen atom, we would fill in the lone pair count of R with the number 2.

1 R u0 p2 ...

If it were matched against a Carbon atom, we would fill in the lone pair count of R with the number 0.

1 R u0 p0 ...

rwest commented 9 years ago

I would have thought None should mean unspecified and therefore accept anything, wheras [] would be an empty list and accept nothing. These are different things in Python and it shouldn't be a problem to tell them apart (sounds like we messed up somewhere in the current implementation. I was a little concerned when I read 'None translates into an empty list'...).

But I'm confused - why shouldn't carbene 1 C u0 p1 c0 match against an R group? whether it's old-fashioned 1 R 0 or new 1 R u[0] p[0,1,2,3,4,5] (or its shorthand 1 R u[0] when implemented correctly) ?

edit: Is it because we used to represent carbene as 1 C 2 and therefore it didn't used to match our R nodes that were spelt 1 R 0 ? If this is the case, then yes, what we meant by 1 R 0 in this context was in fact "anything with no radicals and no carbenes either" which, in the new syntax, is 1 R u[0] p[0] (except that forbids Oxygen? I think I see the problem...?)

nickvandewiele commented 9 years ago

I agree that singlet carbenes should be able to exist, and should be treated distinctly from triplet carbenes.

@connie and I thought that we could use Hundt's rule to determine the spin multiplicity of species with electrons in different orbitals, e.g. 1 X u2 p0 type species, but ignore Hundt's rule for electrons in the same orbital, e.g. 1 X u0 p1.

rwest commented 9 years ago

Ah, so singlet CH2(S) would be 1 C u0 p1 and triplet CH2(T) (usually called CH2) would be 1 C u2 p0. So if you tried to specify the first as a singlet it would throw an exception? That makes more sense. It's going to take me a while to learn to think in terms of these pairs!

nickvandewiele commented 9 years ago

0ad3ba9fbc2b2a7ef7c5c4a2d5a7fa01e4d1d502 adds multiplicity checking in Molecule.findIsomorphism(...). Multiplicity checking was already present in Molecule.isIsomorph(...)

2775356341911d97d2a3b43533bab2f3848b240f are some tests showing the correct behavior.

nickvandewiele commented 9 years ago

cb810dc6b3c248d4897e7575e3b65e38e4ed16f4 is an implementation of Hundt's rule for unpaired electrons in distinct orbitals on the same atom, including some tests.

github-actions[bot] commented 1 year ago

This issue is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant issue, otherwise it will automatically be closed in 30 days.

ReactionMechanismGenerator / RMG-Py

Isomorphism checking when the new adjacency list format is used #279