Open webbres opened 2 years ago
They changed at some point but should be more correct. It might just be whether aromaticity was called or not.
For sure one that used to be skipped was (*).(*)
for multiple components
Thanks John, I'll update the workflow test
For reference these are the failing structures (3 of 19) in the existing test suite for the Fingerprinter node with MACCS selected:
I've updated the testflow to use the new data but this is a change that is likely worth mentioning in the release nodes/documentation when the new nodes are released that this is a known change.
Which is which?
Testing directly in CDK for the last molecule I get:
{104, 124, 144, 161, 162, 164}
(left hand side)
SmilesParser smipar = new SmilesParser(SilentChemObjectBuilder.getInstance());
IFingerprinter fpr = new MACCSFingerprinter();
IAtomContainer m = sp.parseSmiles("C12=CC=CC3=CC=C4C(C5=C6C2=CC=CC6=CC=C5C=C4)=C13");
System.err.println(fpr.getBitFingerprint(m).asBitSet());
Bit 100 is an 8 membered ring:
101 [R]1@*@*@*@*@*@*@*@1 0 |8M Ring (id 101 => bit 100)
Which works as expected:
IAtomContainer mol1 = parser.parseSmiles("*1*******1");
{43, 100, 119, 123, 129, 136, 164}
Here was the change: https://github.com/cdk/cdk/commit/1654c24fa9a00a00d32ed97ca23d85a0430fd360
Previously the pattern was trying to find rings of size 8,9,10,11,12,13, or 14. This was taken from RDKit
std::unique_ptr<RDKit::ROMol> bit_101 = std::unique_ptr<
RDKit::ROMol>(RDKit::SmartsToMol(
"[$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@1),$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@["
"R]@[R]@1),$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@1),$([R]1@[R]@[R]@["
"R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@1),$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]"
"@[R]@[R]@[R]@1),$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@"
"1),$([R]1@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@[R]@1)]"));
We actually short circuit the ring bits and use cycle finding to match them (i.e. Hanser's AllRingsFinder). Here was the explanation for 8 membered and not ≥8 membered ring:
https://github.com/cdk/cdk/commit/df70cd4664b590c8ceb9442d9037a3b44f40e32a
At some point they were reoptimized but I think the original definitions were lost to the sands of times.
https://pubs.acs.org/doi/10.1021/ci010132r
Indeed I just tested a recent BIOVIA Direct install and it provides the following:
SQL> select molkeys(mol('C12=CC=CC3=CC=C4C(C5=C6C2=CC=CC6=CC=C5C=C4)=C13'), 'DEC DELIM=,') from dual;
MOLKEYS(MOL('C12=CC=CC3=CC=C4C(C5=C6C2=CC=CC6=CC=C5C=C4)=C13'),'DECDELIM=,')
--------------------------------------------------------------------------------
132,136,221,285,287,290,325,336,381,392,399,404,412,413,447,452,463,466,467,472,
490,501,510,519,548,549,553,554,556,562,568,571,573,576,578,593,600,601,608,609,
616,623,624,633,641,647,657,658,661,665,671,672,674,678,688,689,701,703,713,720,
721,726,732,734,743,745,746,750,751,759,761,768,769,776,788,791,793,795,796,798,
812,817,823,826,827,841,848,880,889,891,893,895,902,921,924,925,929,932
I think this perhaps suggest they have switched to a Daylight style path based FP. But we have no way of knowing. Some more support for the 8M ring only is in T J O'Donnel Book:
"Design and Use of Relational Databases in Chemistry"
And Roger's MACCS implementation for ChemAxon, it was on their forum but it seems you may need to log in to view it.
Here's the differences with more detail:
SMILES | CDK 1.5 | CDK 2.7 |
---|---|---|
Clc1ccc2Sc3ccccc3N(CCCN4CCN(C)CC4)c2c1 | cardinality = 47 {35, 58, 74, 78, 79, 80, 84, 85, 86, 87, 92, 97, 99, 100, 102, 104, 106, 110, 111, 114, 115, 117, 119, 120, 121, 124, 127, 128, 133, 134, 136, 137, 141, 143, 144, 146, 147, 149, 152, 154, 155, 157, 159, 160, 161, 162, 164} | cardinality = 46 {35, 58, 74, 78, 79, 80, 84, 85, 86, 87, 92, 97, 99, 102, 104, 106, 110, 111, 114, 115, 117, 119, 120, 121, 124, 127, 128, 133, 134, 136, 137, 141, 143, 144, 146, 147, 149, 152, 154, 155, 157, 159, 160, 161, 162, 164} |
CCN(CC)CCCN(C1Cc2ccccc2C1)c3ccccc3 | cardinality = 36 {79, 84, 85, 95, 99, 100, 104, 107, 110, 111, 113, 115, 117, 121, 124, 127, 128, 132, 134, 137, 141, 143, 144, 146, 147, 148, 149, 152, 154, 155, 157, 159, 160, 161, 162, 164} | cardinality = 35 {79, 84, 85, 95, 99, 104, 107, 110, 111, 113, 115, 117, 121, 124, 127, 128, 132, 134, 137, 141, 143, 144, 146, 147, 148, 149, 152, 154, 155, 157, 159, 160, 161, 162, 164} |
c1cc2ccc3ccc4ccc5cccc6c(c1)c2c3c4c56 | cardinality = 9 {100, 104, 124, 143, 144, 149, 161, 162, 164} | cardinality = 8 {104, 124, 143, 144, 149, 161, 162, 164} |
I think each scenario is bit 100
is no longer set which from the above seems like the new expected behaviour :)
TODO: check CDK history to see if the MACCS keys are expected to be different