TUCAN-nest / TUCAN

A molecular identifier and descriptor for all domains of chemistry.
https://tucan-nest.github.io
GNU General Public License v3.0
22 stars 5 forks source link

Using non-unique cycle finding as invariant #52

Closed johnmay closed 2 years ago

johnmay commented 2 years ago

I had not heard of the CycleCut method before and it appears to be a non-unique cycle basis. Finding cycles using a breath-first search you essentially end up with multiple shortest paths and you get one dependant on the input order. Here are some investigations and explanation - at the very least I think it needed a clearer explanation in the manuscript.

A simple to follow case is the following:

C1(CC2)CCC(CCC3CCC2CC3)CC1

image

I get 3 rings here, 2x size 6 and 1x size 12. The size 12 one will depend on the order the BFS traversed the nodes/edges.

image

>>> _find_atomic_cycles(graph_from_smiles("C1(CC2)CCC(CCC3CCC2CC3)CC1"))
[{0, 3, 4, 5, 14, 15}, {8, 9, 10, 11, 12, 13}, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}]
>>> _cycle_memberships(m)
{0: {0, 1, 3}, 1: {0, 3}, 2: {0, 3}, 3: {0, 1, 3}, 4: {0, 1, 3}, 5: {0, 1, 3}, 6: {0, 3}, 7: {0, 3}, 8: {0, 2, 3}, 9: {0, 2, 3}, 10: {0, 2, 3}, 11: {0, 2, 3}, 12: {0, 2}, 13: {0, 2}, 14: {0, 1}, 15: {0, 1}}
0,3,4,5,8,9,10,11 belong to 2 cycles
1,2,6,7,12,13,13,15 belong to 1 cycle

Using the print_molecule to get the graph invariants and presuming the indexes match the SMILES I can use CXSMILES to display as follows:

C1(CC2)CCC(CCC3CCC2CC3)CC1 |$_AV:3,1;3;3;3,1;3,1;3,1;3;3;3,2;3,2;3,2;3,2;2;2;1;1$|

To me this looks the invariant is splitting atoms which are symmetric - i.e. it's not invariant? Beyond the non-uniqueness using the cycle id like this means you also introduce an artificial split.

Useful reading on ring/cycle algorithms: https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-6-3 https://www.combinatorics.org/ojs/index.php/eljc/article/view/v4i1r9/pdf

JanCBrammer commented 2 years ago

Thanks for the issue!

My (for now unverified) assumption (aka the mother of all fuck-ups) is that CycleCut only identifies chordless cycles as in https://en.wikipedia.org/wiki/Induced_path#atomic_cycles: 159926902-155305ed-bb8c-4abd-b0e3-5a82e7811397

the invariant is splitting atoms which are symmetric

Yes, I was wondering about that too. How does splitting automorphism groups influence the correctness of the canonicalization? From the demo notebook:

Strictly speaking, fundamental cycles are not invariants, since the cycle labels depend on the node labels. Fundamental cycles split automorphism groups. For example, the ring atoms in bicyclohexane (TODO: show) would be automorphic without the fundamental cycle invariant. TODO: What are the implications of this for correctness of canonicalization (make question more precise)?

johnmay commented 2 years ago

Yes, I was wondering about that too. How does splitting automorphism groups influence the correctness of the canonicalization?

I'm pretty sure it means it fundamentally won't work - would need to think about it to construct a counter example or just run a shuffle test (randomising atom order) over a sufficiently large set. It is a little suspicious that "rigid graph benchmark" (I'd not seen that before but is cool - thanks!) seemed to be trivial, NAUTY/SAUCY etc are something else (as in hyper efficient) and a benchmark that defeats them but is easily passed with a simple cycle invariant seems odd.

In the RDKit canonical paper (below) there is a strong invariant which I think NAUTY also likes to use as well (in that case it's configurable though). I think some people call it the "degree vector" but essentially it's a BFS from each node and you write down how many nodes are distance 0, 1, 2, 3, n. away. IIRC there is a slightly stronger variant in the RDKit paper. You will still likely go fast on the rigorous graph benchmark since I presume the NetworkX (was it?) is not backtracking and building the automorphism set like NAUTY etc. You have the right idea and disregarding stereochemistry it's likely possible to handle the majority (perhaps all) chemical graphs with a strong enough graph invariant and avoid the expensive backtracking of NAUTY/InChI - the balance is you can't "know" you have a unique answer just that you "probably" do.

https://pubs.acs.org/doi/10.1021/acs.jcim.5b00543 p.s. There stereochemistry part is wrong in the RDKit paper - it splits atoms that are identical :-).

johnmay commented 2 years ago

One thing that is confusing is I would expect something simple like:

C1CCCCC1CCC(C)C1CCCCC1
C1CCCCC1C(C)CCC1CCCCC1

to generate different ID's - since the invariant code uses the cycleid (which depends on the order of the cycle in the molecule) but this is not the case:

>>> serialize_molecule(graph_from_smiles("C1CCCCC1C(C)CCC1CCCCC1"))
'C16/1-15/2-3/2-14/3-4/4-5/5-6/6-14/7-8/7-15/8-16/9-10/9-16/10-11/11-12/12-13/13-16/14-15'
>>> serialize_molecule(graph_from_smiles("C1CCCCC1CCC(C)C1CCCCC1"))
'C16/1-15/2-3/2-14/3-4/4-5/5-6/6-14/7-8/7-14/8-15/9-10/9-16/10-11/11-12/12-13/13-16/15-16'

I suspected that the encoding "gets lucky" with a starting point, we can force it to start somewhere else:

C1CC([Li])CCC1C(C)CCC1CCC([Li])CC1 NDINSTUPYQAZRK-UHFFFAOYSA-N
C1CC([Li])CCC1CCC(C)C1CCC([Li])CC1 NDINSTUPYQAZRK-UHFFFAOYSA-N
>>> serialize_molecule(graph_from_smiles("C1CC([Li])CCC1C(C)CCC1CCC([Li])CC1"))
'C16Li2/1-14/2-15/3-17/4-5/4-16/5-14/6-7/6-14/7-16/8-9/8-17/9-18/10-11/10-18/11-15/12-13/12-15/13-18/16-17'
>>> serialize_molecule(graph_from_smiles("C1CC([Li])CCC1CCC(C)C1CCC([Li])CC1"))
'C16Li2/1-14/2-15/3-17/4-5/4-16/5-14/6-7/6-14/7-16/8-9/8-16/9-17/10-11/10-18/11-15/12-13/12-15/13-18/17-18'
johnmay commented 2 years ago

Sorry my mistake - that first one is different! "/14-15" and "/15-16" at the end. Was hard to spot with all the numbers.

JanCBrammer commented 2 years ago

You're right, it seems like my implementation of CycleCut does return cycles with chords (see cycle 3). It should only return what's here labeled 1 and 2. Will investigate...

m = graph_from_smiles("C1(CC2)CCC(CCC3CCC2CC3)CC1")

cycle_memberships = nx.get_node_attributes(m, "cycle_membership")
cycle_memberships_str = dict([k, str(v - {0})] for k, v in cycle_memberships.items())

draw_molecules([m], [""], labels=cycle_memberships_str)

output

Will come back to this next week, thanks for the input so far!

johnmay commented 2 years ago

Just to be clear there are two separate (but related issues).

  1. The cycle id depends on the input order and identical molecules ordered differently get different identifiers. There is no mixup with "chordless" cycles in this example.

    C1CCCCC1CCC(C)C1CCCCC1 -> C16/1-15/2-3/2-14/3-4/4-5/5-6/6-14/7-8/7-15/8-16/9-10/9-16/10-11/11-12/12-13/13-16/14-15
    C1CCCCC1C(C)CCC1CCCCC1 -> C16/1-15/2-3/2-14/3-4/4-5/5-6/6-14/7-8/7-14/8-15/9-10/9-16/10-11/11-12/12-13/13-16/15-16
  2. The cycle basis is non-unique, paths generate by the cycles depends on the order generated. As you say "chordless" but I'm not sure that concept extends beyond a single edge and a non-unique way.

JanCBrammer commented 2 years ago

@johnmay, I think I got to the bottom of the issue.

The cycle ID is indeed unsuitable as an invariant for the two reasons you mention above.

The tests did not catch these problems since the tests were wrong. Turns out, your suspicion of the passing test on the rigid graph benchmark is quite justified ;) I felt uneasy about it (the passing tests) as well...

Here's the problem: In the shuffle test, the node labels are permuted correctly. However, the cycle IDs get computed and assigned during the instantiation of the original graph (i.e., the graph whose nodes are shuffled). That is, during the shuffle test, only the node labels were permuted while the cycle IDs associated with the nodes stayed the same. The cycle IDs stayed the same since the graph wasn't reinstantiated after shuffling (this is fixed now: a40ead36ed39748b5cf58ab3be29030bd351f841). Therefore, the canonicalization always used the same cycle IDs irrespective of the shuffling.

Back to the drawing board I guess. I'll replace the cycle invariant with the "degree vector" you're suggesting above or the one proposed in https://doi.org/10.1186/s13321-020-00453-4.

Thanks again for digging into this, and for the (counter-)examples above, they led me to the problem. I needed a second pair of eyes on this!

JanCBrammer commented 2 years ago

Closing with 77f67dcaf45a92f856a285dc386ffbb1cf321c41.

rapodaca commented 2 years ago

Which approach does the closing commit take? Degree vector, the one proposed by Krotko, or something else?

JanCBrammer commented 2 years ago

@rapodaca, none at the moment. I'm planning on implementing the two invariants Krotko proposes (he calls them "ring invariant" and "distance invariant"). Opened #53 to make it clear that this is work in progress.