Closed nickvandewiele closed 12 months ago
The graph isomorphism and subgraph isomorphism algorithms are not that easily to follow especially given that they can be applied in different circumstances, so I'm posting parts of the architecture of these algorithms.
First, some definitions: graph isomorphism is an algorithm that determines whether graph A and graph B are the same graphs.
subgraph isomorphism is an algorithm that determines whether graph A is part of graph B.
Usage in RMG: Graph isomorphism is used to check whether:
Subgraph isomorphism is used to check whether:
Available methods: RMG-Py has two kinds of methods that access graph and subgraph isomorphism:
isIsomorphic, isSubGraphIsomorphic
: returns a yes/no answerfindIsomorphism, findSubgraphIsomorphisms
: returns a mapping between target and query structure.The 4 methods above can be found in either of the two main data structures of RMG: Molecule
and Group
, both found in the rmgpy/molecule
subpackage. Note that findIsomorphism
emphasizes that there is one single unique isomorphism between 2 graphs under the condition that one does make identical atoms distinguishable. On the other hands, a fragment (subgraph) may have multiple distinct and unique matches in a target graph, therefore, the method names uses the plural: findSubgraphIsomorphisms
.
Although two methods exist for each operation, one in molecule.Molecule
, one in group.Group
, both call the same underlying methods in graph.Graph
. Graph
is a very generic datastructure in which vertices and edges are the main components. Both Molecule
and Group
inherit from Graph
, and decorate the basic data structure with some other attributes like electrons.
Graph, on its own, does not contain the actual algorithm, but redirects it to the module vf2
, that contains the actual algorithm, aka VF2 (Vento-Foggia). In addition, although there are 4 methods per class, they all refer to the same underlying method in VF2:
cdef isomorphism(self, Graph graph1, Graph graph2, dict initialMapping, bint subgraph, bint findAll):
in which the flags subgraph
and findAll
are used to add some flavor (eg graph vs subgraph isomorphism). Besides the fact that the algorithm is recursive in nature, one method call in feasible(vertex1, vertex2)
is really important:
if self.subgraph:
if not vertex1.isSpecificCaseOf(vertex2): return False
else:
if not vertex1.equivalent(vertex2): return False
The two methods isSpecificCaseOf
and equivalent
are overridden in the two implementations: Atom
and GroupAtom
in Molecule
and Group
respectively.
Keep in that while the object that calls the method is always part of the class it belongs to, the parameter object of the method can either belong to Atom
or GroupAtom
. Therefore, each implementation of the method has a conditional in which it is checked to which class the parameter object belong.
The difference between Atom
and GroupAtom
is that for the latter each atom attribute (atomtype, radicalElectrons, charge, lonePairs) is a list of possible values, rather than one specific value. Therefore, the equivalent
and isSpecificCaseOf
methods need to iterate over the list rather than checking a single value.
The only remaining complexity is the comparison of atomtypes. atomtype.AtomType
also has equivalent
and isSpecificCaseOf
method.
AtomType.equivalent(atomtype)
roughly checks whether two atomtypes are connected by a path that goes straight up or straight down the atom type tree.
E.g.:
'R!H'.equivalent('C')
returnsTrue
'C'.equivalent('R!H')
returnsTrue
'C'.equivalent('Os')
returnsFalse
In other words, it is a symmetric function.
AtomType.isSpecificCaseOf(atomtype)
is not symmetric, as the name suggests:
'R!H'.isSpecificCaseOf('C')
returns False
'C'.isSpecificCaseOf('R!H')
returnsTrue
To put everything together, the following example shows the comparison of two groups 'R!H' and 'C' and the two isomorphism methods:
'R!H' isIsomorphic.( 'C' )
returnsTrue
since it calls'R!H'.equivalent('C')
'R!H' isSubGraphIsomorphic.( 'C' )
returnsFalse
since it calls 'R!H'.isSpecificCaseOf('C')
This type of analysis and write-up is so helpful (thanks @nickvandewiele!) that it must end up in our developer documentation, and not just as comments on an issue (though this is also a good place). Perhaps even in docstrings somewhere in the code that end up in the documentation via autodoc?
@connie and I discussed some additional implications of the new adjacency list format on (sub)graph isomorphism.
As explained above, (sub)graph isomorphism boils down to Atom and Bond comparisons, and each implementation of the equivalent
and isSpecificCaseOf
methods of vertices/edges defines when 2 atoms or bonds are equal, and hence when two graphs are isomorph.
Since we transferred the information on spin multiplicity from an atom attribute (remember: 1 C 2T ...
) to a Molecule
or Species
attribute, the graph isomorphism methods have no way of telling two Molecule's or Groups apart anymore when they have different multiplicities:
E.g.:
mol = Molecule().fromAdjacencyList("""
multiplicity 1
1 C u2 p0 c0
""", saturateH=True)
mol2 = Molecule().fromAdjacencyList("""
multiplicity 3
1 C u2 p0 c0
""", saturateH=True)
print mol.isIsomorphic(mol2) #True
As a result, unless we add additional multiplicity comparisons on the graph isomorphism class levels of Molecule
or Species
, this will lead to false positives.
Moreover, the above example not only also shows a false positive of graph isomorphism, it also shows that people can define molecules that do not comply with Hundt's rule:
Hundt's rule
every orbital in a subshell is singly occupied with one electron before any one orbital is doubly occupied, and all electrons in singly occupied orbitals have the same spin
The above molecule with 2 unpaired electrons on a single atom, should always have multiplicity 3.
One way to enforce this is to add a consistency check in the Molecule().fromAdjacencyList(...)
, similar to the consistency checking of the combination of u, p, c, and of the multiplicity and u, and to thrown an error when the input adjacency list does not comply with Hundt's rule.
I think the graphs are isomorphic, although the molecules are unique if they have a different spin. So you're right, we would need equivalency checks in the Molecule or Species class to check the multiplicity before saying the molecules/species are identical (should we use the word isomorphic in this case?).
Does Hundt's rule only describe ground states? Are you saying it should not be possible to represent singlet CH2 in RMG? Almost all kinetic models out there have both CH2 and CH2(S) in them....
The next issue that came up in the discussion is the treatment of unspecified lone pair electrons in Groups.
Consider the following example of a generic R group in the old format:
1 R 0 {...}
or in the new format, this would translate into:
1 R u0 p??? ...
where the ??? can be filled in with what seems to be most appropriate value.
Right now, if we can't find a 'lone pair' identifier in the new format, adjlist.fromAdjacencyList(...)
will add None when Groups are concerned:
https://github.com/GreenGroup/RMG-Py/blob/master/rmgpy/molecule/adjlist.py#L474
else:
if group:
lonePairs.append(None)
The None translates into an empty list and leads to non-matching between generic R groups and whatever Atom we match it against, because the atom comparison methods iterate over an empty list: https://github.com/nickvandewiele/RMG-Py/blob/isomorphism_check_new_format/rmgpy/molecule/molecule.py#L207
for lp in ap.lonepairs:
if self.lonePairs == lp: break
else:
return False
This is probably not the behavior that we want, since an R group should match at least some atoms.
On the other hand, we don't want to create list of all possible lone pairs e.g. [1,2,3,4,5] since that would match singlet carbene atoms 1 C u0 p1 c0
to the generic R node. Ideally, we would fill in the lone pair possibilities of the generic R group with the default lone pair count of the element it is matched against.
E.g. if we would match an R group against an Oxygen atom, we would fill in the lone pair count of R with the number 2.
1 R u0 p2 ...
If it were matched against a Carbon atom, we would fill in the lone pair count of R with the number 0.
1 R u0 p0 ...
I would have thought None
should mean unspecified and therefore accept anything, wheras []
would be an empty list and accept nothing. These are different things in Python and it shouldn't be a problem to tell them apart (sounds like we messed up somewhere in the current implementation. I was a little concerned when I read 'None translates into an empty list'...).
But I'm confused - why shouldn't carbene 1 C u0 p1 c0
match against an R group? whether it's old-fashioned 1 R 0
or new 1 R u[0] p[0,1,2,3,4,5]
(or its shorthand 1 R u[0]
when implemented correctly) ?
edit:
Is it because we used to represent carbene as 1 C 2
and therefore it didn't used to match our R nodes that were spelt 1 R 0
? If this is the case, then yes, what we meant by 1 R 0
in this context was in fact "anything with no radicals and no carbenes either" which, in the new syntax, is 1 R u[0] p[0]
(except that forbids Oxygen? I think I see the problem...?)
I agree that singlet carbenes should be able to exist, and should be treated distinctly from triplet carbenes.
@connie and I thought that we could use Hundt's rule to determine the spin multiplicity of species with electrons in different orbitals, e.g. 1 X u2 p0
type species, but ignore Hundt's rule for electrons in the same orbital, e.g. 1 X u0 p1
.
Ah, so singlet CH2(S) would be 1 C u0 p1
and triplet CH2(T) (usually called CH2) would be 1 C u2 p0
. So if you tried to specify the first as a singlet it would throw an exception? That makes more sense. It's going to take me a while to learn to think in terms of these pairs!
0ad3ba9fbc2b2a7ef7c5c4a2d5a7fa01e4d1d502 adds multiplicity checking in Molecule.findIsomorphism(...). Multiplicity checking was already present in Molecule.isIsomorph(...)
2775356341911d97d2a3b43533bab2f3848b240f are some tests showing the correct behavior.
cb810dc6b3c248d4897e7575e3b65e38e4ed16f4 is an implementation of Hundt's rule for unpaired electrons in distinct orbitals on the same atom, including some tests.
This issue is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant issue, otherwise it will automatically be closed in 30 days.
With the migration to the new adjacency list format, and new atom attributes like
number of unpaired electrons
,number of electron pairs
, andcharge
, and molecule attributes likespin multiplicity
, it's quite possible that the isomorphism methods that compare groups to molecules and molecules to molecules will result in false positive, and false negative matches.I will: