CDK-R / cdkr

Integrating R and the CDK
https://cdk-r.github.io/cdkr/
42 stars 27 forks source link

Correct atom configuration prior fingerprint calculation #94

Open bachi55 opened 5 years ago

bachi55 commented 5 years ago

Hello,

I was wondering about the correct usage of do.typing(), do.aromaticity() and do.isotopes() before calculating the fingerprints of a molecules parsed from its SMILES using parse.smiles().

Let me go through a few examples.

MACCS The documentation of get.fingerprints contains:

smiles <- c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1','C1CCC1CC(CN(C)(C))CC(=O)CC')
mols <- parse.smiles(smiles)
fps <- lapply(mols, get.fingerprint, type='maccs')

There is no information, that the aromaticity might need to be perceived first as otherwise some SMARTS are not properly matched (?). In the CDK tests for the MACCSFingerprinter at least there is atom-typing and aromaticity-detection done.

Pubchem For this fingerprinter the CDK tests indicate that implicit hydrogens should be converted to explicit ones, i.e. convert.implicit.to.explicit.

Klekota and Roth Here the CDK tests indicate, that no "additional" function, e.g. typing or aromaticity detection, needs to be applied.

So I could continue the list of different fingerprinters, which seem to expect different "modifications" done to the parsed molecular structure. Some fingerprinters seems to take care about it internally (within the class).

I would like to know, how others are dealing with this issue? Which transformations you perform, when you calculate different fingerprints (I could believe that problem continues for descriptors as well)? Am I overthinking the problem here? Can a "to much modifications" (I called do.typing if it is not needed, e.g.) harm my calculation, i.e. wrong fingerprints?

I believe the problem should somehow be solved in CDK, but how can we maybe in the rcdk side make the documentation more precise?

Best regards,

Eric

schymane commented 5 years ago

@bachi55 this is a great point and it's a problem we are encountering beyond the fingerprints (look at e.g. #73 and the massfuncs branch where we are trying to capture this for the mass calculations). Historically in RMassBank we patched this by doing the typing etc our side in wrapper functions to ensure consistency. What concerns me is that they are also order dependent (change the order of the commands and you get different results) and this is should not happen (or if it does, then users need guidance what order is "right").

We've had conversation (email) with @rajarshi about updating documentation to roxygen anyway and not only to they need roxygenising but also a revamp to deliver additional clarity. If we start to sum up the time I'm investing into clarifying these aspects to colleagues via email, it'd start adding up to the time it would take to upgrade the docs and the unit tests to ensure we're doing it right. Just need this magic thing called time ;-)

@rajarshi any idea how best to coordinate? I am hiring and hope to soon have either own capacity or team member capacity to potentially contribute to a joint effort as it's impacting what we need to achieve. My logical start point would be on the masses but we're still stuck with the NPEs ...

@bachi55 my approach for the masses was to build unit tests for all the "edge cases" I would expect and I would use those also to build the documentation to clarify (in examples) what to do and what not to do ... once we have them as unit tests we can then run them and interact with the CDK guys to check rcdk is doing what it should ...

bachi55 commented 5 years ago

Hei Emma,

thanks for your comments. It's "good" to hear, that the problem is somehow known.

I am at the moment using your approach to wrap rcdk, when calculating the fingerprints to have at least consistent computations. And, I will update my code so that it can test for some "edge cases". Do you have a list of those for fingerprints as well?

Actually, I think it would be good, if the CDK framework would directly take care about the correct configurations, depending on the fingerprint used.

Best regards,

Eric

schymane commented 5 years ago

I haven’t looked into this for fingerprints, but was answering some queries about them only a few days ago ;-) and some of the edge cases I pulled out for the mass functions may apply – I’d say you need cases that cover aromaticity, explicit/implicit Hs, tautomers and probably also isotopes and I have examples that cover several bits of that in massfuncs: https://github.com/CDK-R/cdkr/blob/massfuncs/rcdk/inst/unitTests/runit.mass.R Make sure you also capture some cases that should distinguish the different features of the fingerprints .. @rajarshi may have better ideas there …

rajarshi commented 5 years ago

So, yes, this is a pain, but also a bit surprising. I had thought that the SMARTS parser would have performed the necessary configuration, but apparently not.

library(rcdk)

m1 <- parse.smiles("c1ccccc1")[[1]] # m1 will have all atoms aromatic
m2 <- parse.smiles("C1=CC=CC=C1")[[1]] # m2 will have no atoms aromatic

fp1 <- get.fingerprint(m1, type='maccs')
fp2 <- get.fingerprint(m2, type='maccs')
fp1@bits == fp2@bits

The last equality check will fail. A quick test suggests that the sequence

do.aromaticity()
do.typing()
do.isotopes()

will appropriately configure the molecule for all fingerprints. AFAICT, "over configuration" is not an issue. So applying the above sequence should not create an issue

But not that this doesn't change implicit H to explicit, yet the Pubchem FP seems to work OK (gives the same input for the same molecule input in different forms). So I guess the Pubchem FP is handling the conversion.

m1 <- parse.smiles("c1ccccc1C")[[1]] # m1 will have all atoms aromatic
m2 <- parse.smiles("[CH]1=[CH][CH]=[CH][CH]=C1C([H])([H])([H])")[[1]] # m2 will have no atoms aromatic

do.aromaticity(m1)
do.typing(m1)
do.isotopes(m1)
do.aromaticity(m2)
do.typing(m2)
do.isotopes(m2)

length(get.atoms(m1)) == 7 # No explicit H
length(get.atoms(m2)) == 10 # All explicit H

fp1 <- get.fingerprint(m1, type='pubchem')
fp2 <- get.fingerprint(m2, type='pubchem')
fp1@bits == fp2@bits

I agree with both of you that updating the docs would go far to clarifying usage and best practice. I had started on roxygenizing the docs, but now that the holidays are over, it's slowed down.

I would agree that if a fingerprinter requires config, it could do the configuration but at the cost of redoing something that has already been done (but the latest updates may avoid the performance hit by skipping config if already done). Probably a good idea to check on cdk-user

FWIW, the CDKDescUI tool explicitly performs all configuration when computing fingerprints.

bachi55 commented 5 years ago

@rajarshi thank you for your comments!

I played around with your example an I think for PubChem fingerprints one still needs to apply convert.implicit.to.explicit to the molecule.

In your example I believe m2's SMILES is including explicit hydrogens whereas m1 does not. What I would expect is, that the PubChem bits 0 (>= 4 H) and 1 (>= 8 H) are on (TRUE), but they are not:

m1 <- parse.smiles("c1ccccc1C")[[1]] # m1 will have all atoms aromatic
m2 <- parse.smiles("[CH]1=[CH][CH]=[CH][CH]=C1C([H])([H])([H])")[[1]] # m2 will have no atoms aromatic

do.aromaticity(m1)
do.typing(m1)
do.isotopes(m1)

do.aromaticity(m2)
do.typing(m2)
do.isotopes(m2)

fp1 <- get.fingerprint(m1, type='pubchem')
fp2 <- get.fingerprint(m2, type='pubchem')

print(paste("m1: >= 4 H:", 1 %in% fp1@bits))  # PubChem bits are zero based but R not
print(paste("m2: >= 4 H:", 1 %in% fp2@bits))

print(paste("m1: >= 8 H:", 2 %in% fp1@bits)) 
print(paste("m2: >= 8 H:", 2 %in% fp2@bits))

Output:

[1] "m1: >= 4 H: FALSE"
[1] "m2: >= 4 H: FALSE"
[1] "m1: >= 8 H: FALSE"
[1] "m2: >= 8 H: FALSE"

If I add convert.implicit.to.explicit:

m1 <- parse.smiles("c1ccccc1C")[[1]] # m1 will have all atoms aromatic
m2 <- parse.smiles("[CH]1=[CH][CH]=[CH][CH]=C1C([H])([H])([H])")[[1]] # m2 will have no atoms aromatic

do.aromaticity(m1)
do.typing(m1)
do.isotopes(m1)
convert.implicit.to.explicit(m1)

do.aromaticity(m2)
do.typing(m2)
do.isotopes(m2)
convert.implicit.to.explicit(m2)

fp1 <- get.fingerprint(m1, type='pubchem')
fp2 <- get.fingerprint(m2, type='pubchem')

print(paste("m1: >= 4 H:", 1 %in% fp1@bits))  # PubChem bits are zero based but R not
print(paste("m2: >= 4 H:", 1 %in% fp2@bits))

print(paste("m1: >= 8 H:", 2 %in% fp1@bits)) 
print(paste("m2: >= 8 H:", 2 %in% fp2@bits))

We get the "desired" output:

[1] "m1: >= 4 H: TRUE"
[1] "m2: >= 4 H: TRUE"
[1] "m1: >= 8 H: TRUE"
[1] "m2: >= 8 H: TRUE"

At the moment I use the CDK unittests and the source code as information-source to decide about the correct configuration. However, as you said, if over-configuration is not an issue, it is more "save" to simply apply it to all molecules by default.

Best regards,

Eric