Accio / KEGGgraph

The KEGGgraph package to parse KEGG pathways in R into graph objects
13 stars 3 forks source link

http URL breaks retrieveKGML(), please use https instead #9

Closed hpages closed 1 year ago

hpages commented 1 year ago

Hi @Accio ,

Trying to run the retrieveKGML example from the man page (?retrieveKGML) produces the following error on Windows and Mac, where the wget command is usually not available:

library(KEGGgraph)
tmp <- tempfile()
retrieveKGML(pathwayid='00010', organism='cel', destfile=tmp, method="wget")
# Error in download.file(kgml, destfile = destfile, method = method, ...) :
#   'wget' call had nonzero exit status
# In addition: Warning message:
# In system(paste("wget", paste(extra, collapse = " "), shQuote(url),  :
#   'wget' not found

The error occurs when retrieveKGML() calls download.file() internally to download the kgml file:

kgml <- "http://rest.kegg.jp/get/cel00010/kgml"
download.file(kgml, destfile=tmp, method="wget")  # FAILS (because the 'wget' command is not on the machine)

Unfortunately all the other download.file() methods have their own problems: they either don't work on kgml, or are deprecated, or introduce other requirements.

Method "auto" (the default)

> download.file(kgml, destfile=tmp, method="auto")
trying URL 'http://rest.kegg.jp/get/cel00010/kgml'
downloaded 41 KB

Error in download.file(kgml, destfile = tmp, method = "auto") :
  download from 'http://rest.kegg.jp/get/cel00010/kgml' failed
In addition: Warning message:
In download.file(kgml, destfile = tmp, method = "auto") :
  URL 'https://rest.kegg.jp/get/cel00010/kgml': status was 'Failure when receiving data from the peer'

Method "internal"

> download.file(kgml, destfile=tmp, method="internal")
Error in download.file(kgml, destfile = tmp, method = "internal") :
  the 'internal' method for http:// URLs is defunct

Method "libcurl"

> download.file(kgml, destfile=tmp, method="libcurl")
trying URL 'http://rest.kegg.jp/get/cel00010/kgml'
downloaded 41 KB

Error in download.file(kgml, destfile = tmp, method = "libcurl") :
  download from 'http://rest.kegg.jp/get/cel00010/kgml' failed
In addition: Warning message:
In download.file(kgml, destfile = tmp, method = "libcurl") :
  URL 'https://rest.kegg.jp/get/cel00010/kgml': status was 'Failure when receiving data from the peer'

Method "wininet" (deprecated)

> download.file(kgml, destfile=tmp, method="wininet")
trying URL 'http://rest.kegg.jp/get/cel00010/kgml'
Content type 'text/xml' length unknown
downloaded 41 KB

Warning message:
In download.file(kgml, destfile = tmp, method = "wininet") :
  the 'wininet' method is deprecated for http:// and https:// URLs

Method "curl" (works but requires the curl command)

> download.file(kgml, destfile=tmp, method="curl")
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

But luckily, all these problems can be avoided by replacing http with https in the URL. Then download.file() works out-of-the-box on all platforms, and without the need to specify the method argument:

> download.file("https://rest.kegg.jp/get/cel00010/kgml", destfile=tmp)
trying URL 'https://rest.kegg.jp/get/cel00010/kgml'
downloaded 41 KB

Would you please consider making that change?

Note that the KEGGgraph package still manages to pass BUILD and CHECK on the Bioconductor build machines, by putting the problematic example inside a \dontrun statement. However this is contrary to Bioconductor guidelines because it only hides the problem without addressing it. For example, other packages that use retrieveKGML() (e.g. the GenomicRanges package) are unfortunately broken on Windows and Mac. See: https://bioconductor.org/checkResults/3.17/bioc-LATEST/GenomicRanges/

Thank you,

H.

sessionInfo():

R Under development (unstable) (2022-11-30 r83393 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 20348)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] KEGGgraph_1.59.0

loaded via a namespace (and not attached):
[1] compiler_4.3.0      RCurl_1.98-1.9      grid_4.3.0
[4] Rgraphviz_2.43.0    BiocGenerics_0.45.0 bitops_1.0-7
[7] stats4_4.3.0        XML_3.99-0.13       graph_1.77.1
Accio commented 1 year ago

Issued fixed in 11a5723, thanks for report, dear @hpages! Please let me know if I can be of further assistance.

Best, D.

hpages commented 1 year ago

Hi @Accio,

Thanks for the fast response.

I see that "wget" is still the default download method for retrieveKGML() and parseKGMLexpandMaps(), and that you are still explicitely calling retrieveKGML() with method="wget" in the examples. However, please note that we're not planning to install the wget command on our Windows or Mac builders, and your users shouldn't need to do that either. The good news is that, as I reported above, the wget requirement is no longer necessary if you use the https URL.

In other words, with the https URL, the downloads should work out-of-the-box without the need for the user to take any special action. So your functions should just use the same default method as utils::download.file(), which is "auto".

Once you've made those changes, the following instructions (in getKGMLurl.Rd) will no longer be needed:

  For Windows users, it is necessary to download and install \code{wget} program
  (\url{https://gnuwin32.sourceforge.net/packages/wget.htm}) to use the
  \code{wget} method to download files. Sometimes it may be necessary to
  modify searching path to add GnuWin32 folder (where \code{wget}
  execution file is located) and re-install R to make
  \code{wget} work.

  Some user may experience difficulty
  of retrieving KGML files when the download method is set to
  \sQuote{auto}. In this case setting the method to \sQuote{wget} may
  solve the problem (thanks to the report by Gilbert Feng).

I noticed that you've not resynced the KEGGgraph repo on GitHub with the KEGGgraph repo on git.bioconductor.org for a while. Please remember that it's important to keep the 2 repos in sync. It's particularly important to do so before you make changes to the package. Also it's important that you always push your changes to git.bioconductor.org in addition to GitHub.

Finally, the use of method="wget" also breaks things in BioC 3.16, so you would need to apply the same changes to the REALEASE_3_16 version of the package.

Don't hesitate to ask on the bioc-devel mailing list if you have questions or concerns about this.

Thanks for your contribution to Bioconductor.

Cheers, H.

hpages commented 1 year ago

Dear @Accio,

Can you please take care of this? (see above). The current default download methods for retrieveKGML() and parseKGMLexpandMaps() break GenomicRanges on Windows and Mac, which in turn blocks hundreds of packages from propagating in BioC 3.17: https://bioconductor.org/checkResults/3.17/bioc-LATEST/GenomicRanges

Thank you, H.

Accio commented 1 year ago

Dear @hpages,

I fixed the method, updated the doc, and synced with upstream master and release 3_16. In case there are still questions, please kindly let me know.

You are also welcome to contribute with PR. Thank you for pointing me to the problems and for the reminder.

Best, D.

hpages commented 1 year ago

Thanks @Accio.

Note that you still have one wget leftover here: https://github.com/Accio/KEGGgraph/blob/0a12babc8d637eb73b6cd2b0a2fcf1e0f19add15/vignettes/KEGGgraph.Rnw#L86-L92

This code chunk actually doesn't work:

library(KEGGgraph)

tmp <- tempfile()

pName <- "p53 signaling pathway"

data(KEGGPATHNAME2ID)
# Warning message:
# In data(KEGGPATHNAME2ID) : data set ‘KEGGPATHNAME2ID’ not found

pId <- mget(pName, KEGGPATHNAME2ID)[[1]]
# Error in mget(pName, KEGGPATHNAME2ID) : 
#   object 'KEGGPATHNAME2ID' not found

More generally speaking, why do so many code chunks use eval=TRUE? Bioconductor guidelines is to avoid unevaluated code chunks. Having them evaluated allows our daily builds to automatically catch and report these issues.

Thanks again, H.

Accio commented 1 year ago

Hi @hpages, thanks for pointing that. I took the chance to update the vignettes (see 0a12babc8d...a49eb9aefdf9a). Now essential codes of KEGGgraph are evaluated.

Best regards, D.

hpages commented 1 year ago

Thanks!