MassBank / RMassBank

Playground for experiments on the official http://bioconductor.org/packages/devel/bioc/html/RMassBank.html
Other
12 stars 15 forks source link

gatherData not working properly #101

Closed ermueller closed 9 years ago

ermueller commented 9 years ago

So, gatherData in createMassBank.R has some major issues as of right now. I noticed this when playing around with the glucolesquerellin data in the GUI. By extension, the mbWorkflow is broken for certain compounds.

I figure this issue is for Michele and me to solve.

I guess this happened because of CTS reworking everything?

The problems: line 401:

    if(length(infos) == 0)

infos should technically be a list of all kinds of synonyms, but when CTS doesn't know anything about it, it doesn't return an empty list anymore, but rather the string "Sorry, we couldn't find any matching results"

so this should be:

     if(infos == "Sorry, we couldn't find any matching results")

Which is kind of strange, but it works, I guess.

The much bigger problem is from line 420 on: if the name Glucolesquerellin is used, then this happens:

> infos <- getCtsKey(dbname, from="Chemical Name", to="InChIKey")
> # heuristically determine best InChI key to use:
> # use the one with most common Structure part,
> # and use the one with no stereochemistry and neutral charge if possible
> keys <- as.data.frame(infos)
> subkeys <- strsplit(infos, ',')
> df <- do.call(rbind,subkeys)
> keys$structure <- df[,1]
> keys$stereo <- df[,2]
Fehler in df[, 2] : Indizierung außerhalb der Grenzen
> keys$charge <- df[,3]
Fehler in df[, 3] : Indizierung außerhalb der Grenzen

because the return structure of infos from "getCtsKeys" has been changed, I reckon. infos in this case just is a single data vector containing the InChIKey as a string, so no wonder it doesn't work. Splitting by commas does practically nothing, and from there on everything just stops working. I'm kind of at a loss on how to resolve this.

uchem-massbank commented 9 years ago

We also now have a Build Report Error, related?? Also related to #88 ?

mbWorkflow: Step 1. Gather info from CTS

Error: processing vignette 'RMassBank.Rnw' failed with diagnostics: chunk 22 Error in strsplit(infos, ",") : non-character argument Execution halted


From: emueller [notifications@github.com] Sent: Wednesday, 26 November 2014 11:58 PM To: meowcat/RMassBank Subject: [RMassBank] gatherData not working properly (#101)

So, gatherData in createMassBank.R has some major issues as of right now. I noticed this when playing around with the glucolesquerellin data in the GUI.

This issue is for Michele and me to figure out? I think.

I guess this happened because of CTS reworking everything?

The problems: line 401: if(length(infos) == 0)

infos should technically be a list of all kinds of synonyms, but when CTS doesn't know anything about it, it doesn't return an empty list anymore, but rather the string "Sorry, we couldn't find any matching results"

so this should be: if(infos == "Sorry, we couldn't find any matching results")

Which is kind of strange, but it works, I guess.

The much bigger problem is from line 420 on: if the name Glucolesquerellin is used, then this happens:

infos <- getCtsKey(dbname, from="Chemical Name", to="InChIKey")

heuristically determine best InChI key to use:

use the one with most common Structure part,

and use the one with no stereochemistry and neutral charge if possible

keys <- as.data.frame(infos) subkeys <- strsplit(infos, ',') df <- do.call(rbind,subkeys) keys$structure <- df[,1] keys$stereo <- df[,2] Fehler in df[, 2] : Indizierung außerhalb der Grenzen keys$charge <- df[,3] Fehler in df[, 3] : Indizierung außerhalb der Grenzen

because the return structure of "infos" from "getCtsKeys" has been changed, I reckon.

— Reply to this email directly or view it on GitHubhttps://github.com/meowcat/RMassBank/issues/101.

schymane commented 9 years ago

I just ran the demo data without pre-loading an infolist and it works fine. Not sure why we are getting a build error then? @ermueller : Michele said also he couldn't reproduce your problem iirc

mb <- mbWorkflow(mb, infolist_path="./Narcotics_infolist.csv") mbWorkflow: Step 1. Gather info from CTS 2818: smiles 2819: smiles ... mbWorkflow: Step 2. Export infolist (if required) The file ./Narcotics_infolist.csv was generated with new compound information. Please check and edit the table, and add it to your infolist folder. Warning message: In gatherData(id) : Compound ID 2824: no IUPAC name could be identified.

sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] RMassBankData_1.0.0 RMassBank_1.5.2.5 Rcpp_0.11.2

loaded via a namespace (and not attached): [1] Biobase_2.22.0 BiocGenerics_0.8.0 codetools_0.2-8 fingerprint_3.5.2 [5] iterators_1.0.7 mzR_1.8.1 parallel_3.0.2 png_0.1-7
[9] rcdk_3.2.9.1 rcdklibs_1.5.8.3 RCurl_1.95-4.3 rJava_0.9-6
[13] rjson_0.2.14 tools_3.0.2 XML_3.98-1.1 yaml_2.1.13

schymane commented 9 years ago

build report now fine, CTS must have been down temporarily

ermueller commented 9 years ago

The error was with the plant dataset, Glucolesquerellin. If you load up settings and the compound list and then do gatherData(2184) it throws an error.

ermueller commented 9 years ago

So, uh Cactus is strange

smiles [1] "CSCCCCCCC(=NOS(=O)=O)S[C@H]1C@@HO" getCactus(smiles, 'stdinchikey') [1] "InChIKey=ZAKICGFSIJSCSF-LPUQOGTASA-N" getCactus("ZAKICGFSIJSCSF-LPUQOGTASA-N", 'chemspider_id') [1] NA

Apparently it can't resolve inchikes anymore, because I just put in the exact same inchikey it gave me for the smiles that Glucolesquerellin has. Any ideas?

schymane commented 9 years ago

So, another result of this is that we now seem to get long and ugly warnings: Warning messages: 1: In if (infos == "Sorry, we couldn't find any matching results") dataUsed <- "dbname" else dataUsed <- "smiles" : the condition has length > 1 and only the first element will be used

Can we suppress that somehow? RMassBank_1.9.2.1 (Erik's version)

schymane commented 9 years ago

So, looking at spectra.Glucolesquerellin: mb <- mbWorkflow(mb, steps=1:8) mbWorkflow: Step 1. Gather info from CTS Error in df[, 2] : subscript out of bounds But, CACTUS appears to be down totally, so this is a bad time to debug. I get NA for everything and can't get to the web either. However, what you say above: smiles [1] "CSCCCCCCC(=NOS(=O)=O)S[C@H]1C@@HO" this smiles should be: CSCCCCCCC(=NOS(=O)(=O)O)SC1C(C(C(C(O1)CO)O)O)O. That is what I get running findSmiles(2184) and also from PubChem.

schymane commented 9 years ago

Looking at the next bit, CTS:

The much bigger problem is from line 420 on: if the name Glucolesquerellin is used, then this happens:

gs <- getCtsKey(findName(2184), from="Chemical Name", to="InChIKey") keys <- as.data.frame(gs) subkeys <- strsplit(gs, ',') df <- do.call(rbind,subkeys) keys$structure <- df[,1] keys$stereo <- df[,2] Error in df[, 2] : subscript out of bounds

So, I can confirm that, but if we look at subkeys...

subkeys <- strsplit(gs, ',') subkeys [[1]] [1] "ZAKICGFSIJSCSF-LPUQOGTASA-N"

Well, it's not splitting anything, but not surprising, there is no "," in an InChI Key.

subkeys <- strsplit(gs, '-') subkeys [[1]] [1] "ZAKICGFSIJSCSF" "LPUQOGTASA" "N"

and then the rest works!

df <- do.call(rbind,subkeys) df [,1] [,2] [,3] [1,] "ZAKICGFSIJSCSF" "LPUQOGTASA" "N" keys$structure <- df[,1] keys$stereo <- df[,2] keys$structure [1] "ZAKICGFSIJSCSF" keys$stereo [1] "LPUQOGTASA"

So, replace the comma in the split keys with a dash and this should work? I am not quite sure how this escaped us so long, we must have not had to fall back on CTS for a while, or the comma snuck in...

ermueller commented 9 years ago

Removed the warning message

1: In if (infos == "Sorry, we couldn't find any matching results") dataUsed <- "dbname" else dataUsed <- "smiles" :

in 4756ea4

ermueller commented 9 years ago

Possible fix added in a comment in 1c3ba7b041e5e806140e0cd79759dd733eeb1aa5 Not sure if it works every time, so just a comment

sneumann commented 9 years ago

This fix can be enabled and thoroughly tested with the "christmas dataset"