Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
30 stars 13 forks source link

BUG: list_ftp_dir() chokes when going via proxy server because it drops the URI protocol #99

Closed HenrikBengtsson closed 10 months ago

HenrikBengtsson commented 11 months ago

Issue

> GenomeInfoDb::find_NCBI_assembly_ftp_dir("GCF_000001405.40", assembly_name = "GRCh38.p14")
Error in GenomeInfoDb::find_NCBI_assembly_ftp_dir("GCF_000001405.40",  : 
  unable to find FTP dir for assembly GCF_000001405.40 in
  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/

Troubleshooting

I can reproduce this on systems where there is an FTP/HTTP/HTTPS proxy in-place. Specifically, we have environments variables like:

https_proxy=http://proxy-server:3210/
http_proxy=http://proxy-server:3210/
ftp_proxy=http://proxy-server:3210/
RSYNC_PROXY=proxy-server:3210

set up. The proxy server runs Squid 4.15.

Digging deeper, I notice that the captured output from:

url <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
bfr <- GenomeInfoDb::list_ftp_dir(url, recache = TRUE)

is corrupted when running via the proxy, whereas it works when not running via the proxy. In turn, the reason for this is that list_ftp_dir() drops the URI protocol from the URL at the very beginning from calling .normarg_ftp_dir();

https://github.com/Bioconductor/GenomeInfoDb/blob/20b9758f9061d247d6f7bcb8217cdf22fbeedaf1/R/list_ftp_dir.R#L91-L94

Because of this, .getURL2();

https://github.com/Bioconductor/GenomeInfoDb/blob/20b9758f9061d247d6f7bcb8217cdf22fbeedaf1/R/list_ftp_dir.R#L101

is called without the https: protocol, i.e.

url <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
url_sans_protocol <- GenomeInfoDb:::.normarg_ftp_dir(url)
print(url_sans_protocol)
# [1] "ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"

When passing the latter to a proxy server, the server cannot perform a fully transparent call, resulting in different content being return. This affects the GenomeInfoDb::list_ftp_dir() -> GenomeInfoDb:::.getURL2() -> RCurl::getURL() call. Here is what we'd expect to see:

$ curl --silent https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /genomes/all/GCF/000/001/405</title>
 </head>
 <body>
<h1>Index of /genomes/all/GCF/000/001/405</h1>
<pre>Name                         Last modified      Size  <hr><a href="/genomes/all/GCF/000/001/">Parent Directory</a>                                  -   
<a href="GCF_000001405.10_NCBI34/">GCF_000001405.10_NCBI34/</a>     2023-12-05 01:06    -   
<a href="GCF_000001405.11_NCBI35/">GCF_000001405.11_NCBI35/</a>     2023-12-05 04:53    -   
<a href="GCF_000001405.12_NCBI36/">GCF_000001405.12_NCBI36/</a>     2023-12-05 04:53    -   
<a href="GCF_000001405.13_GRCh37/">GCF_000001405.13_GRCh37/</a>     2023-12-05 07:30    -   
...
<a href="GCF_000001405.40_GRCh38.p14/">GCF_000001405.40_GRCh38.p14/</a> 2023-12-05 02:20    -   
<a href="GCF_000001405.8_NCBI33/">GCF_000001405.8_NCBI33/</a>      2023-12-05 01:06    -   
<a href="GCF_000001405.9_NCBI34/">GCF_000001405.9_NCBI34/</a>      2023-12-05 07:30    -   
<hr></pre>
<a href="https://www.hhs.gov/vulnerability-disclosure-policy/index.html">HHS Vulnerability Disclosure</a>

However, without the HTTPS: protocol in the URL, RCurl::getURL() operates on:

$ curl --silent ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
url --silent ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ | head -10
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<meta type="copyright" content="Copyright (C) 1996-2021 The Squid Software Foundation and contributors">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/</title>
<style type="text/css"><!--
 /*
 * Copyright (C) 1996-2021 The Squid Software Foundation and contributors
 *
 * Squid software is distributed under GPLv2+ license and includes
...
</head><body id=ERR_DIR_LISTING>
<div id="titles">
<h2>Directory: <a href="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/">ftp://ftp.ncbi.nlm.nih.gov/geno
mes/all/GCF/000/001/405/</a>/</h2>
</div>
<hr>

<div id="content">
<h4>Directory Content:</h4>

<blockquote id="data">
<pre id="dirmsg">
250 CWD command successful</pre>
</blockquote>

<table id="dirlisting" summary="Directory Listing">
<tr>
<th><a href="../"><img border="0" src="/squid-internal-static/icons/silk/arrow_up.png" alt=""></a></th>
<th nowrap="nowrap"><a href="../">Parent Directory</a> (<a href="/">Root Directory</a>)</th>
</tr>

<tr class="entry"><td class="icon"><a href="GCF_000001405.10_NCBI34/"><img border="0" src="/squid-internal-static/icons/silk/folder.png" alt="[DIR] "></a></td><td class="filename"><a href="GCF_000001405.10_NCBI34/">GCF_000001405.10_NCBI34</a></td><td class="date">Dec  5 06:06</td><td class="size"></td><td class="actions"></td></tr>
<tr class="entry"><td class="icon"><a href="GCF_000001405.11_NCBI35/"><img border="0" src="/squid-internal-static/icons/silk/folder.png" alt="[DIR] "></a></td><td class="filename"><a href="GCF_000001405.11_NCBI35/">GCF_000001405.11_NCBI35</a></td><td class="date">Dec  5 09:53</td><td class="size"></td><td class="actions"></td></tr>
...
<div id="footer">
<p>Generated Tue, 05 Dec 2023 22:14:18 GMT by proxy-server.ucsf.edu (squid/4.15)</p>
<!-- ERR_DIR_LISTING -->
</div>
</body></html>

Some more details on what the proxy server does when there's no URI:

curl --head ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
HTTP/1.1 200 Gatewaying
Server: squid/4.15
Mime-Version: 1.0
Date: Tue, 05 Dec 2023 22:47:05 GMT
Content-Type: text/html
X-Cache: MISS from proxy-server.ucsf.edu
X-Cache-Lookup: MISS from proxy-server.ucsf.edu:3210
Via: 1.1 proxy-server.ucsf.edu (squid/4.15)
Connection: keep-alive

Conclusion

I'm pretty sure that the correct way to handle this is to not drop the URI protocol, as deliberately done in:

https://github.com/Bioconductor/GenomeInfoDb/blob/20b9758f9061d247d6f7bcb8217cdf22fbeedaf1/R/list_ftp_dir.R#L6-L20

Related

If the Bioconductor check servers run via a proxy server too, the above would explain the current check errors on Bioc release and Bioc devel;

* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ...
  Running ‘run_unitTests.R’
 ERROR
Running the tests in ‘tests/run_unitTests.R’ failed.
Last 13 lines of output:
  GenomeInfoDb RUnit Tests - 21 test functions, 1 error, 0 failures
  ERROR in test_seqlevelsStyle_Seqinfo: Error in find_NCBI_assembly_ftp_dir(assembly_accession, assembly_name = assembly_name) : 
    unable to find FTP dir for assembly GCA_000001515.3 in
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/515/

  Test files with failing tests

     test_seqlevelsStyle.R 
       test_seqlevelsStyle_Seqinfo 

  Error in BiocGenerics:::testPackage("GenomeInfoDb") : 
    unit tests failed for package GenomeInfoDb
  Calls: <Anonymous> -> <Anonymous>
  Execution halted

I suspect that this proxy kerfuffle is also behind the errors reported in issues #78 and #98.

EDIT: Added more details on the proxy server and what the returned HTTP header is when there's no URI protocol.

hpages commented 10 months ago

Thanks for the report.

RCurl::getURL() needs to be called with the ftp:// protocol in order to produce a result that list_ftp_dir() can reliably parse. However, I noticed that dropping ftp:// from the URL didn't seem to make any difference as far as calling RCurl::getURL() was concerned. And since there were other benefits in dropping the protocol part from the URL passed to list_ftp_dir(), I didn't bother to add it back when calling RCurl::getURL().

Anyways, now I'm adding it back. See commit a8730d7c3491d53d4d8d9b42608622bfe1f249ab.

Hopefully that solves the proxy issue. Let me know if it doesn't.

H.

HenrikBengtsson commented 10 months ago

Anyways, now I'm adding it back. See commit https://github.com/Bioconductor/GenomeInfoDb/commit/a8730d7c3491d53d4d8d9b42608622bfe1f249ab.

Thanks. Did you push this to Bioc release too? Need to know which version I should check.

hpages commented 10 months ago

Did you push this to Bioc release too?

Yes. It's version 1.38.2 in release.

HenrikBengtsson commented 10 months ago

GenomeInfoDb 1.38.2 is now available;

> packageVersion("GenomeInfoDb")
[1] '1.38.2'

The problem is still there. Here's a minimal reproducible example:

> trace(GenomeInfoDb::list_ftp_dir, at = 3L, tracer = quote(print(ftp_dir)))
Tracing function "list_ftp_dir" in package "GenomeInfoDb"
[1] "list_ftp_dir"
> body(GenomeInfoDb::list_ftp_dir)
{
    ftp_dir <- .normarg_ftp_dir(ftp_dir)
    {
        .doTrace(print(ftp_dir), "step 3")
        stopifnot(isTRUEorFALSE(subdirs.only))
    }
    stopifnot(isTRUEorFALSE(long.listing))
    stopifnot(isTRUEorFALSE(recache))
    listing <- .cached_ftp_dir_listing[[ftp_dir]]
    if (is.null(listing) || recache) {
        doc <- .getURL2(paste0("ftp://", ftp_dir))
        listing <- strsplit(doc, "\n", fixed = TRUE)[[1L]]
        .cached_ftp_dir_listing[[ftp_dir]] <- listing
    }
    if (subdirs.only) 
        listing <- listing[substr(listing, 1L, 1L) == "d"]
    ans <- .ftp_listing_as_matrix(listing)
    if (!long.listing) 
        ans <- ans[, 9L]
    ans
}
> url <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
> bfr <- GenomeInfoDb::list_ftp_dir(url, recache = TRUE)
Tracing GenomeInfoDb::list_ftp_dir(url, recache = TRUE) step 3 
[1] "ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"

because:

> url <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
> url
[1] "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
> GenomeInfoDb:::.normarg_ftp_dir(url)
[1] "ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
HenrikBengtsson commented 10 months ago

Oh, I just looked at your fix (commit a8730d7c3491d53d4d8d9b42608622bfe1f249ab); I doubt that using ftp:// regardless of the original protocol is safe.

And since there were other benefits in dropping the protocol part from the URL ...

What are those benefits? I don't understand why dropping the protocol should be necessary.

HenrikBengtsson commented 10 months ago

... another argument is that silently replacing HTTPS with FTP, is considered a security issue these days.

hpages commented 10 months ago

Yes ftp:// must be used here. https:// or http:// are not suitable for the purpose of list_ftp() (they produce a result that is too difficult to parse).

Note that list_ftp_dir() is for internal use and is only used on URLs that actually start with ftp://.

list_ftp() is memoized. Dropping the protocol part of the URL allows to use clean keys for the caching.

hpages commented 10 months ago

So IIUC this doesn't work for you @HenrikBengtsson? (using curl directly at the command line)

hpages@XPS15:~$ curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ | head
dr-xr-xr-x   2 ftp      anonymous     4096 Dec 20 06:53 GCF_000001405.10_NCBI34
dr-xr-xr-x   2 ftp      anonymous     4096 Dec 20 17:36 GCF_000001405.11_NCBI35
dr-xr-xr-x   2 ftp      anonymous     4096 Dec 20 17:36 GCF_000001405.12_NCBI36
dr-xr-xr-x   2 ftp      anonymous     4096 Dec 19 11:32 GCF_000001405.13_GRCh37
dr-xr-xr-x   2 ftp      anonymous     4096 Dec 19 11:32 GCF_000001405.14_GRCh37.p2
dr-xr-xr-x   2 ftp      anonymous     4096 Dec 19 11:33 GCF_000001405.17_GRCh37.p5
dr-xr-xr-x   2 ftp      anonymous     4096 Dec 19 14:33 GCF_000001405.21_GRCh37.p9
dr-xr-xr-x   2 ftp      anonymous     4096 Dec 19 14:33 GCF_000001405.22_GRCh37.p10
dr-xr-xr-x   2 ftp      anonymous     4096 Dec 19 14:33 GCF_000001405.23_GRCh37.p11
dr-xr-xr-x   2 ftp      anonymous     4096 Dec 19 14:33 GCF_000001405.24_GRCh37.p12
HenrikBengtsson commented 10 months ago

So IIUC this doesn't work for you @HenrikBengtsson?

Ah... so this issue has now split into two difference issues. I'll comment on the failing ftp:// protocol in this reply. Yes, correct, I get an HTML document when trying to access that URL;

$ curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ | head
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<meta type="copyright" content="Copyright (C) 1996-2021 The Squid Software Foundation and contributors">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/</title>
<style type="text/css"><!--
 /*
 * Copyright (C) 1996-2021 The Squid Software Foundation and contributors
 *
 * Squid software is distributed under GPLv2+ license and includes

$ curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ | tail

</table>
</div>

<hr>
<div id="footer">
<p>Generated Thu, 21 Dec 2023 23:46:05 GMT by sec1.wynton.ucsf.edu (squid/4.15)</p>
<!-- ERR_DIR_LISTING -->
</div>
</body></html>

One can definitely blame the FTP proxy server for this one. I'll investigate this a bit more, but in summary, I found the following on https://wiki.squid-cache.org/SquidFaq/InnerWorkings#can-i-make-my-regular-ftp-clients-use-a-squid-cache:

Q. Can I make my regular FTP clients use a Squid cache?

A. Nope, its not possible. Squid only accepts HTTP requests. It speaks FTP on the server-side, but not on the client-side.

So, it could very well be that Squid doesn't support transparent FTP ("download") traffic. Now, I believe Squid is a fairly common HTTP/HTTPS/FTP/RSYNC proxy, which means other Bioconductor users out there might be affected by this (if this is truly not supported by Squid).

HenrikBengtsson commented 10 months ago

Yes ftp:// must be used here. https:// or http:// are not suitable for the purpose of list_ftp() (they produce a result that is too difficult to parse).

I see. But then I think passing other protocols than ftp:// should be an error.

Note that list_ftp_dir() is for internal use and is only used on URLs that actually start with ftp://.

I don't think that's correct. See my original comment;

> GenomeInfoDb::find_NCBI_assembly_ftp_dir("GCF_000001405.40", assembly_name = "GRCh38.p14")
Error in GenomeInfoDb::find_NCBI_assembly_ftp_dir("GCF_000001405.40",  : 
  unable to find FTP dir for assembly GCF_000001405.40 in
  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/

> traceback()
2: stop(wmsg(msg))
1: GenomeInfoDb::find_NCBI_assembly_ftp_dir("GCF_000001405.40", 
       assembly_name = "GRCh38.p14")

Note that GenomeInfoDb::find_NCBI_assembly_ftp_dir() calls:

https://github.com/Bioconductor/GenomeInfoDb/blob/fe81f0d423f07efabb6e61d6c57620171a6b75b7/R/NCBI-utils.R#L300-L304

where

> GenomeInfoDb:::.NCBI_ALL_ASSEMBLY_FTP_DIR
[1] "https://ftp.ncbi.nlm.nih.gov/genomes/all/"

list_ftp() is memoized. Dropping the protocol part of the URL allows to use clean keys for the caching.

Maybe the following can achieve the same without dropping the URL protocol in the call;

diff --git a/R/list_ftp_dir.R b/R/list_ftp_dir.R
index 99c41d1..ac98f90 100644
--- a/R/list_ftp_dir.R
+++ b/R/list_ftp_dir.R
@@ -91,16 +91,16 @@
 list_ftp_dir <- function(ftp_dir, subdirs.only=FALSE, long.listing=FALSE,
                                   recache=FALSE)
 {
-    ftp_dir <- .normarg_ftp_dir(ftp_dir)
+    ftp_dir_sans_protocol <- .normarg_ftp_dir(ftp_dir)
     stopifnot(isTRUEorFALSE(subdirs.only))
     stopifnot(isTRUEorFALSE(long.listing))
     stopifnot(isTRUEorFALSE(recache))

-    listing <- .cached_ftp_dir_listing[[ftp_dir]]
+    listing <- .cached_ftp_dir_listing[[ftp_dir_sans_protocol]]
     if (is.null(listing) || recache) {
-        doc <- .getURL2(paste0("ftp://", ftp_dir))
+        doc <- .getURL2(ftp_dir)
         listing <- strsplit(doc, "\n", fixed=TRUE)[[1L]]
-        .cached_ftp_dir_listing[[ftp_dir]] <- listing
+        .cached_ftp_dir_listing[[ftp_dir_sans_protocol]] <- listing
     }

     if (subdirs.only)
hpages commented 10 months ago

Note that list_ftp_dir() is for internal use and is only used on URLs that actually start with ftp://.

I don't think that's correct. See my original comment;

Fair enough. Forgot about that. I should probably replace

.NCBI_ALL_ASSEMBLY_FTP_DIR <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/"

with

.NCBI_ALL_ASSEMBLY_FTP_DIR <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/"

I meant to say that list_ftp_dir() is meant to be used on URLs that actually start with ftp://. If not, instead of raising an error, it will replace the protocol part with ftp://. It's just a convenient feature that I like to have for my own personal satisfaction :wink:

Anyways, I'm not sure where this conversation is going. We need to use the FTP protocol to retrieve content from an FTP server. That's what list_ftp_dir() does. What can we do to help if you are behind a proxy that is messing up FTP requests? What do you propose?

HenrikBengtsson commented 10 months ago

... I should probably replace ...

Yes, I think that would help clarify things. To recap, I came here because I thought the protocol was HTTPS, based on the error message. If the error message had shown FTP, I would probably have gone down another troubleshooting path.

I would also make sure to call RCurl::getUrl() with the URL protocol still there. That feels more robust, clarifies the code, and helps debugging.

Anyways, I'm not sure where this conversation is going.

I think the above closes it.

What can we do to help if you are behind a proxy that is messing up FTP requests? What do you propose?

There are few options outside this particular issue. For a starter, if it is what I suspect that Squid doesn't support "transparent" FTP listing, it could be mentioned/documented (in a help page). I need to verify that. In the long run, it could be worth to support parsing the HTTPS output, especially since the world is slowly moving away from FTP, but also because I suspect there are other users out there behind Squid proxies.

hpages commented 10 months ago

Happy to take a PR that clarifies the FTP + Squid gotcha since I don't have access to that setup.

In the long run, it could be worth to support parsing the HTTPS output, especially since the world is slowly moving away from FTP

I thought of that. But Ensembl and NCBI still officially serve data via FTP servers so we are stuck with that for now. They might run an HTTP server on their FTP server, for convenient browsing of the data, but that's just icing on the cake. Going thru this instead of FTP would be VERY fragile since the HTTP server can add all kinds of decorations to the served pages compared to the raw FTP dir listing. In particular, this means that any time they change their HTTP setup there would be a risk that they break list_ftp_dir().

All this to say that I don't think that FTP access to their FTP server is going anywhere soon. But who knows...