cdk / nodes4knime-cdk2

KNIMES nodes using the CDK library.
1 stars 2 forks source link

Fingerprints node generates different MACCS keys #6

Open webbres opened 2 years ago

webbres commented 2 years ago

TODO: check CDK history to see if the MACCS keys are expected to be different

johnmay commented 2 years ago

They changed at some point but should be more correct. It might just be whether aromaticity was called or not.

johnmay commented 2 years ago

For sure one that used to be skipped was (*).(*) for multiple components

webbres commented 2 years ago

Thanks John, I'll update the workflow test

webbres commented 1 year ago

For reference these are the failing structures (3 of 19) in the existing test suite for the Fingerprinter node with MACCS selected:

image

I've updated the testflow to use the new data but this is a change that is likely worth mentioning in the release nodes/documentation when the new nodes are released that this is a known change.

johnmay commented 1 year ago

Which is which?

Testing directly in CDK for the last molecule I get:

{104, 124, 144, 161, 162, 164}

(left hand side)

johnmay commented 1 year ago
SmilesParser   smipar = new SmilesParser(SilentChemObjectBuilder.getInstance());
IFingerprinter fpr    = new MACCSFingerprinter();
IAtomContainer m      = sp.parseSmiles("C12=CC=CC3=CC=C4C(C5=C6C2=CC=CC6=CC=C5C=C4)=C13");
System.err.println(fpr.getBitFingerprint(m).asBitSet());
johnmay commented 1 year ago

Bit 100 is an 8 membered ring:

101 [R]1@*@*@*@*@*@*@*@1 0 |8M Ring (id 101 => bit 100)

Which works as expected:

IAtomContainer mol1 = parser.parseSmiles("*1*******1");
{43, 100, 119, 123, 129, 136, 164}

Here was the change: https://github.com/cdk/cdk/commit/1654c24fa9a00a00d32ed97ca23d85a0430fd360

Previously the pattern was trying to find rings of size 8,9,10,11,12,13, or 14. This was taken from RDKit

 std::unique_ptr<RDKit::ROMol> bit_101 = std::unique_ptr<
      RDKit::ROMol>(RDKit::SmartsToMol(
      "[$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@1),$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@["
      "R]@[R]@1),$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@1),$([R]1@[R]@[R]@["
      "R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@1),$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]"
      "@[R]@[R]@[R]@1),$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@"
      "1),$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@1)]"));

We actually short circuit the ring bits and use cycle finding to match them (i.e. Hanser's AllRingsFinder). Here was the explanation for 8 membered and not ≥8 membered ring:

https://github.com/cdk/cdk/commit/df70cd4664b590c8ceb9442d9037a3b44f40e32a

At some point they were reoptimized but I think the original definitions were lost to the sands of times.

https://pubs.acs.org/doi/10.1021/ci010132r

Indeed I just tested a recent BIOVIA Direct install and it provides the following:

SQL> select molkeys(mol('C12=CC=CC3=CC=C4C(C5=C6C2=CC=CC6=CC=C5C=C4)=C13'), 'DEC DELIM=,') from dual;

MOLKEYS(MOL('C12=CC=CC3=CC=C4C(C5=C6C2=CC=CC6=CC=C5C=C4)=C13'),'DECDELIM=,')
--------------------------------------------------------------------------------
132,136,221,285,287,290,325,336,381,392,399,404,412,413,447,452,463,466,467,472,
490,501,510,519,548,549,553,554,556,562,568,571,573,576,578,593,600,601,608,609,
616,623,624,633,641,647,657,658,661,665,671,672,674,678,688,689,701,703,713,720,
721,726,732,734,743,745,746,750,751,759,761,768,769,776,788,791,793,795,796,798,
812,817,823,826,827,841,848,880,889,891,893,895,902,921,924,925,929,932

I think this perhaps suggest they have switched to a Daylight style path based FP. But we have no way of knowing. Some more support for the 8M ring only is in T J O'Donnel Book:

"Design and Use of Relational Databases in Chemistry"

And Roger's MACCS implementation for ChemAxon, it was on their forum but it seems you may need to log in to view it.

webbres commented 1 year ago

Here's the differences with more detail:

SMILES CDK 1.5 CDK 2.7
Clc1ccc2Sc3ccccc3N(CCCN4CCN(C)CC4)c2c1 cardinality = 47 {35, 58, 74, 78, 79, 80, 84, 85, 86, 87, 92, 97, 99, 100, 102, 104, 106, 110, 111, 114, 115, 117, 119, 120, 121, 124, 127, 128, 133, 134, 136, 137, 141, 143, 144, 146, 147, 149, 152, 154, 155, 157, 159, 160, 161, 162, 164} cardinality = 46 {35, 58, 74, 78, 79, 80, 84, 85, 86, 87, 92, 97, 99, 102, 104, 106, 110, 111, 114, 115, 117, 119, 120, 121, 124, 127, 128, 133, 134, 136, 137, 141, 143, 144, 146, 147, 149, 152, 154, 155, 157, 159, 160, 161, 162, 164}
CCN(CC)CCCN(C1Cc2ccccc2C1)c3ccccc3 cardinality = 36 {79, 84, 85, 95, 99, 100, 104, 107, 110, 111, 113, 115, 117, 121, 124, 127, 128, 132, 134, 137, 141, 143, 144, 146, 147, 148, 149, 152, 154, 155, 157, 159, 160, 161, 162, 164} cardinality = 35 {79, 84, 85, 95, 99, 104, 107, 110, 111, 113, 115, 117, 121, 124, 127, 128, 132, 134, 137, 141, 143, 144, 146, 147, 148, 149, 152, 154, 155, 157, 159, 160, 161, 162, 164}
c1cc2ccc3ccc4ccc5cccc6c(c1)c2c3c4c56 cardinality = 9 {100, 104, 124, 143, 144, 149, 161, 162, 164} cardinality = 8 {104, 124, 143, 144, 149, 161, 162, 164}

I think each scenario is bit 100 is no longer set which from the above seems like the new expected behaviour :)