Open webbres opened 2 years ago
The cdk one now uses smarts as well, I will check back in see if there was a regression but TBH it should really just use JPLogP. Again could be something subtle like the aromatic bond status if coming from molfile or missing expl H which reading the CDK needs to be present.
yeah Xlogp sucks in my opinion
👍 - I was going to expose the new descriptors but at the moment I'm working my way through the 'breaking' changes by switching out to 2.7.
More investigation needed.
The LogP on PubChem is 0.72, PC also reports XLogP3 as 0.8 so shows the implementation still isn't perfect but 0.47 is closer :-).
For ref: ALogP - 0.7122999999999997 For ref: JPLogP - 0.47
The JP is probably over fitted on XLogP, IRRC he left out CDK ALogP because it was broken at the time.
There is a test case for this though that notes:
IAtomContainer mol = sp.parseSmiles("O=C(O)c1[nH0]cccc1"); // xlogp training set molecule no688
It was ignored as [nH0] used to be rejected.
https://github.com/cdk/cdk/commit/1d71596af4d1ff2a44439aa2484c90452e202c93 before https://github.com/cdk/cdk/commit/30738c779854b8bf382bf87465734335255aab26 - 0.47 https://github.com/cdk/cdk/commit/142caf10b66f4fc29adda1bb5a72f54e2593a129 before https://github.com/cdk/cdk/commit/bf386e2c6f1331a646ce0859123137567664404b - 0.47 ... (somewhere here) ... https://github.com/cdk/cdk/commit/f8f22b01a2534d915abdc39f84c0eac4accfe14d - -1.69 https://github.com/cdk/cdk/commit/76d03d74948fcf1de17520def15cf9375dbb7998 - -1.69 9779b93 (before https://github.com/cdk/cdk/commit/76d03d74948fcf1de17520def15cf9375dbb7998) - -1.69
Regression somewhere here: https://github.com/cdk/cdk/compare/f8f22b01a2534d915abdc39f84c0eac4accfe14d..142caf10b66f4fc29adda1bb5a72f54e2593a129
There were no changes between https://github.com/cdk/cdk/commit/f8f22b01a2534d915abdc39f84c0eac4accfe14d and https://github.com/cdk/cdk/commit/142caf10b66f4fc29adda1bb5a72f54e2593a129 in the XLogP descriptor so it is something external. 2 year gap... so need a better approach to find the issue.
OK tracked it down to the AminoAcid smarts check
atom[0] 0.00 => -0.40
atom[1] -0.40 => -0.43
atom[2] -0.43 => -0.34
atom[3] -0.34 => -0.17
atom[4] -0.17 => -0.66
atom[5] -0.66 => -0.54
atom[6] -0.54 => -0.20
atom[7] -0.20 => 0.14
atom[8] 0.14 => 0.47
atom[9] 0.47 => 0.47
atom[10] 0.47 => 0.47
atom[11] 0.47 => 0.47
atom[12] 0.47 => 0.47
atom[13] 0.47 => 0.47
PAIR CHECK => 0.47
alpha amino acid => -1.69
paba => -1.69
salicylFlag => -1.69
OccO => -1.69
Hmm I'm not sure that molecule should be an amino acid :-)
OK found it and can see how it happened. I'll can post a PR but not super happy as in my world this pattern is NOT an amino acid, indeed checking the XLogP paper it shows it should NOT be -1.69. Also shows the implementation is still not quite right and TBH so much of the CDK descriptors are partially finished I wouldn't trust them at all.
For reference:
How the regression happened. We updated the query atom container creation "QueryAtomContainer.createBasic" to include aromaticity status on the atoms (https://github.com/cdk/cdk/commit/b154fead78838a50f404fa1f51edb80c56c0aeb8#diff-6ef2498d7f5198cfda274535ea47ff7b916aa4c6f1a9aa5b50e9e26b3915cde5R50). Previously it would create queries like this "[#6]:[#6]" - since an aromatic bond implies it's atoms here are aromatic it's cleaner to write this as "c:c". This is all fine except the case where the caller then tries to mess around with the pattern after the fact. In XLogP it would convert NCC(=O)O
to [#7]-[#6]-[#6](=[#8])-[#8]
then loop through and change the NC bond to allow aromatic [#7]-,:[#6]-[#6](=[#8])-[#8]
.
Here was where it was dicking around with the query:
if ((bondAtom0.getAtomicNumber() == IElement.C && bondAtom1.getAtomicNumber() == IElement.N)
|| (bondAtom0.getAtomicNumber() == IElement.N && bondAtom1.getAtomicNumber() == IElement.C)
&& bond.getOrder() == IBond.Order.SINGLE) {
aminoAcid.removeBond(bondAtom0, bondAtom1);
QueryBond qbond = new QueryBond(bondAtom0, bondAtom1, Expr.Type.SINGLE_OR_AROMATIC);
aminoAcid.addBond(qbond);
break;
}
Then later on I can and removed all this special query manipulation and replaced it with a SMARTS pattern. Since this was now a no-op (can't have aromatic bond between aliphatic atoms) I made it:
N-C-C(=O)-[O;X2H1+0,X1H0-]
A more accurate pattern to what it needed would have been before the basicQueryContainer change would be:
[#7]-,:[#6]-C(=O)-[O;X2H1+0,X1H0-]
Indeed changing it to that reverts the value to -1.69 but as I highlight above this is wrong anyways.
Ref molecule is picolinic acid 689 not 688 (as per test case). What a mess :-)
Thanks for the info John.
If there's already a going to be a some loss in functionality with the upgrade of the KNIME nodes to CDK v2 (ambit/sketcher) maybe there's not all that much utility in bringing some of the nodes with better replacements forward.
Could just bin the XLogP node and create a JPLogP (or just a molecular descriptors node that allows the selection of JPLogP
And why is it calling a copied version of the class named SmartXLogPDescriptor instead of calling the CDK implementation?