ZeroPM-H2020 / pcapi

Access the PubChem API Services
Other
0 stars 0 forks source link

Number of occurrences of components in standardized output? #3

Closed schymane closed 2 years ago

schymane commented 2 years ago

Hi @PaulThiessen,

@RaoulWolf and I were wondering if there was an easy way to provide (or compute) a count of the components (found in the inputs) in the outputs of the standardizer.

e.g.

Example 1:

CC1C(=NNC2=C(C=C(C=C2)[N+](=O)[O-])O)C(=O)N=N1.CC1=CC=C(C=C1)N2C(=O)C(=C(N2)C)N=NC3=C(C(=CC(=C3)S(=O)(=O)O)[N+](=O)[O-])O.[Na+].[Na+].[Cr]

image

SDF via URL

Component IK Count Comment
VYZAMTAEIAYCRO-UHFFFAOYSA-N 1 Cr
KEAYESYHFKHZAL-UHFFFAOYSA-N 2 Na
UTSQTNIYHSAPLB-UHFFFAOYSA-N 1 first organic part
JVSALHPIAUJGRV-UHFFFAOYSA-N 1 second organic part

Example 2

CC1=C(C(=O)N(N1)C2=CC=CC=C2)N=NC3=C(C=C(C=C3)S(=O)(=O)N(C)C)O.CC1=C(C(=O)N(N1)C2=CC=CC=C2)N=NC3=C(C=C(C=C3)S(=O)(=O)N(C)C)O.[Na+].[Cr]

image

SDF via URL

Component IK Count Comment
VYZAMTAEIAYCRO-UHFFFAOYSA-N 1 Cr
KEAYESYHFKHZAL-UHFFFAOYSA-N 1 Na
IRUVDRGLMKHHGU-UHFFFAOYSA-N 2 organic part

In terms of downstream uses ... this would be very useful information to have.

Otherwise any thoughts how we could run this count ourselves efficiently based on existing outputs? (we had some thoughts, but don't yet have an elegant solution)

schymane commented 2 years ago

Snippets to run within R:

library(devtools)
remotes::install_github("RaoulWolf/pcapi")

library(pcapi)

post_to_standardize("c1ccccc1","SMILES")

smiles <- "CC1C(=NNC2=C(C=C(C=C2)[N+](=O)[O-])O)C(=O)N=N1.CC1=CC=C(C=C1)N2C(=O)C(=C(N2)C)N=NC3=C(C(=CC(=C3)S(=O)(=O)O)[N+](=O)[O-])O.[Na+].[Na+].[Cr]"
eg1 <- post_to_standardize(smiles,"SMILES")

smiles <- "CC1=C(C(=O)N(N1)C2=CC=CC=C2)N=NC3=C(C=C(C=C3)S(=O)(=O)N(C)C)O.CC1=C(C(=O)N(N1)C2=CC=CC=C2)N=NC3=C(C=C(C=C3)S(=O)(=O)N(C)C)O.[Na+].[Cr]"
eg2 <- post_to_standardize(smiles,"SMILES")
PaulThiessen commented 2 years ago

Ah hmmm... interesting question. Short answer right now is no, that sort-of-stoichiometry information isn't there, either from the PUG REST standardization operation, or even in our (PubChem's) internal processing. We do not ever attempt to compute/store the stoichiometry of a mixture. If you look at our full compound record, e.g.

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/CID/6247/record/JSON/?record_type=2d&response_type=display

... down at the end there is a count block:

      "count": {
        "heavy_atom": 27,
        "atom_chiral": 0,
        "atom_chiral_def": 0,
        "atom_chiral_undef": 0,
        "bond_chiral": 0,
        "bond_chiral_def": 0,
        "bond_chiral_undef": 0,
        "isotope_atom": 0,
        "covalent_unit": 3,
        "tautomers": 1
      }

... so we know the total number of heavy atoms, and the total number of separate covalent units, but not how many of each different covalent unit. Even in our internal databases we don't have this information.

And because components can be neutralized, you can't even do an exact substructure match of the component within the mixture... So I can't think of a good way to do this with the results of the standardization call.

schymane commented 2 years ago

Yes, this neutralization is what had us stumped too ... (not easy string matching)

@RaoulWolf maybe there's a workaround possible...

For Example 2 above Open Babel does this: (otoh, this babel output is effectively strsplit(smiles,".",fixed=TRUE)[[1]] which is much easier to implement)

Either split first and count, then standardize, or use this post standardizing to do the count?

image

> post_to_standardize("Cc1c(N=Nc2c(O)cc(S(=O)(=O)N(C)C)cc2)c(=O)n(c2ccccc2)[nH]1","SMILES")
  compound_id_type compound_cid
1                1    136002753
                                                                                                                   iupac_inchi
1 InChI=1S/C18H19N5O4S/c1-12-17(18(25)23(21-12)13-7-5-4-6-8-13)20-19-15-10-9-14(11-16(15)24)28(26,27)22(2)3/h4-11,21,24H,1-3H3
               iupac_inchikey                                            openeye_iso_smiles
1 IRUVDRGLMKHHGU-UHFFFAOYSA-N CC1=C(C(=O)N(N1)C2=CC=CC=C2)N=NC3=C(C=C(C=C3)S(=O)(=O)N(C)C)O
              compound_parent
1 IRUVDRGLMKHHGU-UHFFFAOYSA-N
> post_to_standardize("[Na+]","SMILES")
  compound_id_type compound_cid     iupac_inchi              iupac_inchikey openeye_iso_smiles compound_parent
1                1          923 InChI=1S/Na/q+1 FKNQFGJONOIPTF-UHFFFAOYSA-N              [Na+]            <NA>
2                2      5360545     InChI=1S/Na KEAYESYHFKHZAL-UHFFFAOYSA-N               [Na]            <NA>
schymane commented 2 years ago

Here's the output of the strsplit, which is actually better than using babel (doesn't reinterpret):

smiles <- "CC1=C(C(=O)N(N1)C2=CC=CC=C2)N=NC3=C(C=C(C=C3)S(=O)(=O)N(C)C)O.CC1=C(C(=O)N(N1)C2=CC=CC=C2)N=NC3=C(C=C(C=C3)S(=O)(=O)N(C)C)O.[Na+].[Cr]"
eg2 <- post_to_standardize(smiles,"SMILES")
split_smiles <- strsplit(smiles,".",fixed=TRUE)[[1]]
split_smiles

post_to_standardize(split_smiles[1],"SMILES")
post_to_standardize(split_smiles[3],"SMILES")
post_to_standardize(split_smiles[4],"SMILES")
> split_smiles
[1] "CC1=C(C(=O)N(N1)C2=CC=CC=C2)N=NC3=C(C=C(C=C3)S(=O)(=O)N(C)C)O"
[2] "CC1=C(C(=O)N(N1)C2=CC=CC=C2)N=NC3=C(C=C(C=C3)S(=O)(=O)N(C)C)O"
[3] "[Na+]"                                                        
[4] "[Cr]"        

> post_to_standardize(split_smiles[1],"SMILES")
  compound_id_type compound_cid
1                1    136002753
                                                                                                                   iupac_inchi
1 InChI=1S/C18H19N5O4S/c1-12-17(18(25)23(21-12)13-7-5-4-6-8-13)20-19-15-10-9-14(11-16(15)24)28(26,27)22(2)3/h4-11,21,24H,1-3H3
               iupac_inchikey                                            openeye_iso_smiles
1 IRUVDRGLMKHHGU-UHFFFAOYSA-N CC1=C(C(=O)N(N1)C2=CC=CC=C2)N=NC3=C(C=C(C=C3)S(=O)(=O)N(C)C)O
              compound_parent
1 IRUVDRGLMKHHGU-UHFFFAOYSA-N
> post_to_standardize(split_smiles[3],"SMILES")
  compound_id_type compound_cid     iupac_inchi              iupac_inchikey openeye_iso_smiles compound_parent
1                1          923 InChI=1S/Na/q+1 FKNQFGJONOIPTF-UHFFFAOYSA-N              [Na+]            <NA>
2                2      5360545     InChI=1S/Na KEAYESYHFKHZAL-UHFFFAOYSA-N               [Na]            <NA>
> post_to_standardize(split_smiles[4],"SMILES")
  compound_id_type compound_cid iupac_inchi              iupac_inchikey openeye_iso_smiles compound_parent
1                1        23976 InChI=1S/Cr VYZAMTAEIAYCRO-UHFFFAOYSA-N               [Cr]            <NA>
PaulThiessen commented 2 years ago

Yes you could do it that way, if I understand you correctly: split the SMILES string into components on your end and standardize each one separately, to get back the InChI key for each component. But note you would want to use the component/neutralized form of the standardization result (if there is one). Using CID 6247 as an example again

CC(=O)OC1=CC=CC=C1C(=O)[O-].CC(=O)OC1=CC=CC=C1C(=O)[O-].[Ca+2]

If you standardize the organic part, e.g.

https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/smiles/JSON?smiles=CC%28%3DO%29OC1%3DCC%3DCC%3DC1C%28%3DO%29%5BO-%5D

... you get back CID 2244 as the neutralized component (the second record in the result).

schymane commented 2 years ago

Yes we definitely want the neutral form in the end ...

More snippets:

smiles <- "CC(=O)OC1=CC=CC=C1C(=O)[O-].CC(=O)OC1=CC=CC=C1C(=O)[O-].[Ca+2]"
test_tab <- post_to_standardize(smiles,"SMILES")

cmpt_smiles <- strsplit(smiles,".",fixed=TRUE)[[1]]
cmpt_smiles
unique_cmpt_smiles <- unique(cmpt_smiles)
length(grep(unique_cmpt_smiles[1],cmpt_smiles,fixed = T))
cmpt_tab1 <- post_to_standardize(unique_cmpt_smiles[1],"SMILES")

length(grep(unique_cmpt_smiles[2],cmpt_smiles,fixed = T))
cmpt_tab2 <- post_to_standardize(unique_cmpt_smiles[2],"SMILES")

merge_tab <- merge(cmpt_tab1,cmpt_tab2,all.x=T,all.y=T)
merge_tab <- merge(merge_tab,test_tab,all.x=T,all.y=T)

output:

> smiles <- "CC(=O)OC1=CC=CC=C1C(=O)[O-].CC(=O)OC1=CC=CC=C1C(=O)[O-].[Ca+2]"
> test_tab <- post_to_standardize(smiles,"SMILES")
> 
> cmpt_smiles <- strsplit(smiles,".",fixed=TRUE)[[1]]
> cmpt_smiles
[1] "CC(=O)OC1=CC=CC=C1C(=O)[O-]" "CC(=O)OC1=CC=CC=C1C(=O)[O-]" "[Ca+2]"                     
> unique_cmpt_smiles <- unique(cmpt_smiles)
> length(grep(unique_cmpt_smiles[1],cmpt_smiles,fixed = T))
[1] 2
> cmpt_tab1 <- post_to_standardize(unique_cmpt_smiles[1],"SMILES")
> 
> length(grep(unique_cmpt_smiles[2],cmpt_smiles,fixed = T))
[1] 1
> cmpt_tab2 <- post_to_standardize(unique_cmpt_smiles[2],"SMILES")
> 
> merge_tab <- merge(cmpt_tab1,cmpt_tab2,all.x=T,all.y=T)
> merge_tab <- merge(merge_tab,test_tab,all.x=T,all.y=T)

and the merged table has the neutral forms as compound ID type 2

image

(I didn't merge the count into the table yet)

PaulThiessen commented 2 years ago

Oh sorry github's link feature doesn't work the way I thought it would...

schymane commented 2 years ago

Oh yes, markdown flavour. If in doubt, paste the same thing in both the square and round brackets. That should always work ;-)

Otherwise it's [display text](url)

You can always edit it and then these comments won't make sense anymore ;-)

RaoulWolf commented 2 years ago

I think the way the idea of splitting the SMILES based on the presence of points (.) is relatively easy to implement, but of course this would result in additional queries towards the PubChem API; not sure if this is ok?

If the input to post_to_standardize() is either an InChI or a MOL/SDF, I currently cannot think of an easy "pre-processing" hack? I might give the SMILES implementation a try later this week...

PaulThiessen commented 2 years ago

this would result in additional queries towards the PubChem API; not sure if this is ok?

Yeah that's fine, but this leads into something else I wanted to bring to your attention if it's not already, and that is our automatic blocking of IP addresses when request rate gets too high. See

https://pubchemdocs.ncbi.nlm.nih.gov/dynamic-request-throttling

Basically the point is that if more than ~5 requests/second are coming in from a single user/IP, we have an automated "throttler" that will start blocking requests. Now, standardization is a relatively CPU-intensive process, so it's not likely you'd hit that limit in a single-threaded process. But for other types of PUG REST queries, it's certainly an issue if you're running a script/app over a larger data set and calling PubChem in a tight loop. I wonder, is it possible to build into your API some sort of timer, maybe a wrapper around the HTTP call, that will ensure that this rate isn't exceeded? This would take the burden off the user of your API, who may not know about this issue, and will wonder why suddenly their app isn't working when they start running big jobs.

Anyway, something to consider in the future, but from our (PubChem's) perspective, this is one of the most common help requests/complaints, from users who don't fully understand just how many actual HTTP requests are being made under the hood of the API.

If the input to post_to_standardize() is either an InChI or a MOL/SDF, I currently cannot think of an easy "pre-processing" hack?

Right, no simple way to split it in that case. Would need a chemical toolkit to parse the InChI/SDF and do this separation of covalently bonded components - probably something fairly basic for such toolkits (CDK etc.) but I have no idea if there's anything like that available in R.

schymane commented 2 years ago

Using Paul's example to stretch to InChI and SDF ...

https://pubchem.ncbi.nlm.nih.gov/compound/6247 (download as SDF)

Babel GUI (=> can be implemented in obabel command line)

image

Or using the InChI (just copied from the PubChem record)

image

So you could use babel to get disconnected SMILES and use that as a consistent starting point? Then you can use SMILES with strsplit (or babel), and add the option for InChI / SDF if babel is available (which would increase your dependencies).

Babel would (probably) be a much less painful option than rCDK, which comes with a very large burden of java dependencies, which is what you are trying to avoid ;-) You'd just need to be able to handle a link to obabel

https://openbabel.org/docs/dev/Command-line_tools/babel.html

image

and image

RaoulWolf commented 2 years ago

this leads into something else I wanted to bring to your attention if it's not already, and that is our automatic blocking of IP addresses when request rate gets too high.

Excellent point! We are very aware of this issue (it's prominently featured in the PUG REST docs) and always use some form of rate-limiting when running queries, ranging between 0.5-5 requests per second. For most of the "heavy lifting", the rate-limiting is set to 2 requests per second, and we haven't had trouble with that setting yet. That being said, we should probably also mention this in the README and provide some examples.

Right, no simple way to split it in that case. Would need a chemical toolkit to parse the InChI/SDF and do this separation of covalently bonded components - probably something fairly basic for such toolkits (CDK etc.) but I have no idea if there's anything like that available in R.

Yes that's what I thought as well. I'm a bit hesitant to include extra functionality from CDK or OpenBabel in this R package, because the focus of {pcapi} is really on interacting with PubChem's API. I'll put it on the list of things to consider ;)

schymane commented 2 years ago

Yes ... @PaulThiessen do you know if CACTVS (or any other web service) offers this (splitting into components)?

That's our backdoor to get RMassBank to pass BioConductor checks without them having obabel on their system ...

schymane commented 2 years ago

... but then if they did, we wouldn't have needed what you've just made possible ;-D

Hmm. Chicken and egg.

PaulThiessen commented 2 years ago

Yeah. It's a little circular but you could do an initial call to PUG REST's standardize with InChI/SDF as input, which will return the SMILES for the full structure (ignore the components), then you can use that SMILES to do the splitting and further component counting.

https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/inchi/JSON?inchi=InChI%3D1S%2F2C9H8O4.Ca%2Fc2%2A1-6%2810%2913-8-5-3-2-4-7%288%299%2811%2912%3B%2Fh2%2A2-5H%2C1H3%2C%28H%2C11%2C12%29%3B%2Fq%3B%3B%2B2%2Fp-2

PaulThiessen commented 2 years ago

I could even add an optional flag to the PUG REST call that tells it not to do the full component separation, in order to save a few CPU cycles on our end when components aren't needed.

PaulThiessen commented 2 years ago

Just in case. ;)

https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/inchi/JSON?include_components=n&inchi=InChI%3D1S%2F2C9H8O4.Ca%2Fc2%2A1-6%2810%2913-8-5-3-2-4-7%288%299%2811%2912%3B%2Fh2%2A2-5H%2C1H3%2C%28H%2C11%2C12%29%3B%2Fq%3B%3B%2B2%2Fp-2

schymane commented 2 years ago

Awesome, thanks Paul ;-)

RaoulWolf commented 2 years ago

Oh this is nice, thank you Paul! Will take a look into the implementation tomorrow :)

schymane commented 2 years ago

Hmm OK an interesting edge case ... (and we already saw this happening earlier with some of HP's PMT salts). The standardizer can give out e.g. a different InChIKey to the inputs (before splitting into components).

E.g. this: https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/smiles/SDF?smiles=[CH2]SC(C(OC(C(C(F)(F)F)(F)F)(F)F)F)(F)F(F)F)(F)F)F)(F)F )

is standardized to a different InChIKey ... (removes the radical). Which in this case is unfortunate as the literature reaction involves the radical species. In other cases it will give us back PubChem's preferred tautomer of species that tautomerize (e.g. warfarin) and will help us map some of them together.

SMILES IK Comment
[CH2]SC(C(OC(C(C(F)(F)F)(F)F)(F)F)F)(F)F SQCDMAHMWJWARM-UHFFFAOYSA-N input
CSC(C(OC(C(C(F)(F)F)(F)F)(F)F)F)(F)F JZORSIUHYLZUKX-UHFFFAOYSA-N output (only one entry)

So ... if we go MOL/SDF/InChI => SMILES and then components, or even if we go SMILES => components, we will have to be aware that this may happen.

I was going to suggest maybe putting the original (input) SMILES as a tag in the top entry ... but this will of course not work for MOL/SDF ;-/

PaulThiessen commented 2 years ago

Argh yeah that's a pain. Unfortunately, our SMILES parser simply does not preserve radicals (which is a bit of a hack in the SMILES language anyway). So right, if you take ethyl radical as a simple example

https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/smiles/JSON?smiles=C[CH2]

... gives back ethane, but if you use the InChI from CID 123138, it works as expected:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/inchi/JSON?inchi=InChI%3D1S%2FC2H5%2Fc1-2%2Fh1H2%2C2H3

Radicals should be supported ok from SDF as well.

schymane commented 2 years ago

Ah ok ... so we could rescue the radicals in our deposition if I deposit an SDF ... good to know! I will try ...

RaoulWolf commented 2 years ago

I played a bit back and forth with the different options and couldn't wrap my head around a "sane" implementation. The basic functionality is available with the post_to_standardize() function, and all the additional steps can be done during post processing if I'm not mistaken. Closing this issue now, but will keep thinking about it!