XlogP node generates different values

webbres commented 2 years ago

And why is it calling a copied version of the class named SmartXLogPDescriptor instead of calling the CDK implementation?

johnmay commented 2 years ago

The cdk one now uses smarts as well, I will check back in see if there was a regression but TBH it should really just use JPLogP. Again could be something subtle like the aromatic bond status if coming from molfile or missing expl H which reading the CDK needs to be present.

yeah Xlogp sucks in my opinion

webbres commented 2 years ago

👍 - I was going to expose the new descriptors but at the moment I'm working my way through the 'breaking' changes by switching out to 2.7.

johnmay commented 2 years ago

More investigation needed.

The LogP on PubChem is 0.72, PC also reports XLogP3 as 0.8 so shows the implementation still isn't perfect but 0.47 is closer :-).

For ref: ALogP - 0.7122999999999997 For ref: JPLogP - 0.47

The JP is probably over fitted on XLogP, IRRC he left out CDK ALogP because it was broken at the time.

There is a test case for this though that notes:

        IAtomContainer mol = sp.parseSmiles("O=C(O)c1[nH0]cccc1"); // xlogp training set molecule no688

It was ignored as [nH0] used to be rejected.

johnmay commented 2 years ago

https://github.com/cdk/cdk/commit/1d71596af4d1ff2a44439aa2484c90452e202c93 before https://github.com/cdk/cdk/commit/30738c779854b8bf382bf87465734335255aab26 - 0.47 https://github.com/cdk/cdk/commit/142caf10b66f4fc29adda1bb5a72f54e2593a129 before https://github.com/cdk/cdk/commit/bf386e2c6f1331a646ce0859123137567664404b - 0.47 ... (somewhere here) ... https://github.com/cdk/cdk/commit/f8f22b01a2534d915abdc39f84c0eac4accfe14d - -1.69 https://github.com/cdk/cdk/commit/76d03d74948fcf1de17520def15cf9375dbb7998 - -1.69 9779b93 (before https://github.com/cdk/cdk/commit/76d03d74948fcf1de17520def15cf9375dbb7998) - -1.69

Regression somewhere here: https://github.com/cdk/cdk/compare/f8f22b01a2534d915abdc39f84c0eac4accfe14d..142caf10b66f4fc29adda1bb5a72f54e2593a129

johnmay commented 2 years ago

There were no changes between https://github.com/cdk/cdk/commit/f8f22b01a2534d915abdc39f84c0eac4accfe14d and https://github.com/cdk/cdk/commit/142caf10b66f4fc29adda1bb5a72f54e2593a129 in the XLogP descriptor so it is something external. 2 year gap... so need a better approach to find the issue.

johnmay commented 2 years ago

OK tracked it down to the AminoAcid smarts check

atom[0] 0.00 => -0.40
atom[1] -0.40 => -0.43
atom[2] -0.43 => -0.34
atom[3] -0.34 => -0.17
atom[4] -0.17 => -0.66
atom[5] -0.66 => -0.54
atom[6] -0.54 => -0.20
atom[7] -0.20 => 0.14
atom[8] 0.14 => 0.47
atom[9] 0.47 => 0.47
atom[10] 0.47 => 0.47
atom[11] 0.47 => 0.47
atom[12] 0.47 => 0.47
atom[13] 0.47 => 0.47
 PAIR CHECK => 0.47
 alpha amino acid  => -1.69
 paba  => -1.69
 salicylFlag  => -1.69
 OccO  => -1.69

Hmm I'm not sure that molecule should be an amino acid :-)

johnmay commented 2 years ago

OK found it and can see how it happened. I'll can post a PR but not super happy as in my world this pattern is NOT an amino acid, indeed checking the XLogP paper it shows it should NOT be -1.69. Also shows the implementation is still not quite right and TBH so much of the CDK descriptors are partially finished I wouldn't trust them at all.

For reference:

How the regression happened. We updated the query atom container creation "QueryAtomContainer.createBasic" to include aromaticity status on the atoms (https://github.com/cdk/cdk/commit/b154fead78838a50f404fa1f51edb80c56c0aeb8#diff-6ef2498d7f5198cfda274535ea47ff7b916aa4c6f1a9aa5b50e9e26b3915cde5R50). Previously it would create queries like this "[#6]:[#6]" - since an aromatic bond implies it's atoms here are aromatic it's cleaner to write this as "c:c". This is all fine except the case where the caller then tries to mess around with the pattern after the fact. In XLogP it would convert NCC(=O)O to [#7]-[#6]-[#6](=[#8])-[#8] then loop through and change the NC bond to allow aromatic [#7]-,:[#6]-[#6](=[#8])-[#8].

Here was where it was dicking around with the query:

if ((bondAtom0.getAtomicNumber() == IElement.C && bondAtom1.getAtomicNumber() == IElement.N)
          || (bondAtom0.getAtomicNumber() == IElement.N && bondAtom1.getAtomicNumber() == IElement.C)
          && bond.getOrder() == IBond.Order.SINGLE) {
      aminoAcid.removeBond(bondAtom0, bondAtom1);
      QueryBond qbond = new QueryBond(bondAtom0, bondAtom1, Expr.Type.SINGLE_OR_AROMATIC);
      aminoAcid.addBond(qbond);
      break;
  }

Then later on I can and removed all this special query manipulation and replaced it with a SMARTS pattern. Since this was now a no-op (can't have aromatic bond between aliphatic atoms) I made it:

N-C-C(=O)-[O;X2H1+0,X1H0-]

A more accurate pattern to what it needed would have been before the basicQueryContainer change would be:

[#7]-,:[#6]-C(=O)-[O;X2H1+0,X1H0-]

Indeed changing it to that reverts the value to -1.69 but as I highlight above this is wrong anyways.

johnmay commented 2 years ago

Ref molecule is picolinic acid 689 not 688 (as per test case). What a mess :-)

webbres commented 1 year ago

Thanks for the info John.

If there's already a going to be a some loss in functionality with the upgrade of the KNIME nodes to CDK v2 (ambit/sketcher) maybe there's not all that much utility in bringing some of the nodes with better replacements forward.

Could just bin the XLogP node and create a JPLogP (or just a molecular descriptors node that allows the selection of JPLogP

cdk / nodes4knime-cdk2

XlogP node generates different values #5