Closed HenrikBengtsson closed 10 months ago
Thanks for the report.
RCurl::getURL()
needs to be called with the ftp://
protocol in order to produce a result that list_ftp_dir()
can reliably parse. However, I noticed that dropping ftp://
from the URL didn't seem to make any difference as far as calling RCurl::getURL()
was concerned. And since there were other benefits in dropping the protocol part from the URL passed to list_ftp_dir()
, I didn't bother to add it back when calling RCurl::getURL()
.
Anyways, now I'm adding it back. See commit a8730d7c3491d53d4d8d9b42608622bfe1f249ab.
Hopefully that solves the proxy issue. Let me know if it doesn't.
H.
Anyways, now I'm adding it back. See commit https://github.com/Bioconductor/GenomeInfoDb/commit/a8730d7c3491d53d4d8d9b42608622bfe1f249ab.
Thanks. Did you push this to Bioc release too? Need to know which version I should check.
Did you push this to Bioc release too?
Yes. It's version 1.38.2 in release.
GenomeInfoDb 1.38.2 is now available;
> packageVersion("GenomeInfoDb")
[1] '1.38.2'
The problem is still there. Here's a minimal reproducible example:
> trace(GenomeInfoDb::list_ftp_dir, at = 3L, tracer = quote(print(ftp_dir)))
Tracing function "list_ftp_dir" in package "GenomeInfoDb"
[1] "list_ftp_dir"
> body(GenomeInfoDb::list_ftp_dir)
{
ftp_dir <- .normarg_ftp_dir(ftp_dir)
{
.doTrace(print(ftp_dir), "step 3")
stopifnot(isTRUEorFALSE(subdirs.only))
}
stopifnot(isTRUEorFALSE(long.listing))
stopifnot(isTRUEorFALSE(recache))
listing <- .cached_ftp_dir_listing[[ftp_dir]]
if (is.null(listing) || recache) {
doc <- .getURL2(paste0("ftp://", ftp_dir))
listing <- strsplit(doc, "\n", fixed = TRUE)[[1L]]
.cached_ftp_dir_listing[[ftp_dir]] <- listing
}
if (subdirs.only)
listing <- listing[substr(listing, 1L, 1L) == "d"]
ans <- .ftp_listing_as_matrix(listing)
if (!long.listing)
ans <- ans[, 9L]
ans
}
> url <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
> bfr <- GenomeInfoDb::list_ftp_dir(url, recache = TRUE)
Tracing GenomeInfoDb::list_ftp_dir(url, recache = TRUE) step 3
[1] "ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
because:
> url <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
> url
[1] "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
> GenomeInfoDb:::.normarg_ftp_dir(url)
[1] "ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
Oh, I just looked at your fix (commit a8730d7c3491d53d4d8d9b42608622bfe1f249ab); I doubt that using ftp://
regardless of the original protocol is safe.
And since there were other benefits in dropping the protocol part from the URL ...
What are those benefits? I don't understand why dropping the protocol should be necessary.
... another argument is that silently replacing HTTPS with FTP, is considered a security issue these days.
Yes ftp://
must be used here. https://
or http://
are not suitable for the purpose of list_ftp()
(they produce a result that is too difficult to parse).
Note that list_ftp_dir()
is for internal use and is only used on URLs that actually start with ftp://
.
list_ftp()
is memoized. Dropping the protocol part of the URL allows to use clean keys for the caching.
So IIUC this doesn't work for you @HenrikBengtsson? (using curl
directly at the command line)
hpages@XPS15:~$ curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ | head
dr-xr-xr-x 2 ftp anonymous 4096 Dec 20 06:53 GCF_000001405.10_NCBI34
dr-xr-xr-x 2 ftp anonymous 4096 Dec 20 17:36 GCF_000001405.11_NCBI35
dr-xr-xr-x 2 ftp anonymous 4096 Dec 20 17:36 GCF_000001405.12_NCBI36
dr-xr-xr-x 2 ftp anonymous 4096 Dec 19 11:32 GCF_000001405.13_GRCh37
dr-xr-xr-x 2 ftp anonymous 4096 Dec 19 11:32 GCF_000001405.14_GRCh37.p2
dr-xr-xr-x 2 ftp anonymous 4096 Dec 19 11:33 GCF_000001405.17_GRCh37.p5
dr-xr-xr-x 2 ftp anonymous 4096 Dec 19 14:33 GCF_000001405.21_GRCh37.p9
dr-xr-xr-x 2 ftp anonymous 4096 Dec 19 14:33 GCF_000001405.22_GRCh37.p10
dr-xr-xr-x 2 ftp anonymous 4096 Dec 19 14:33 GCF_000001405.23_GRCh37.p11
dr-xr-xr-x 2 ftp anonymous 4096 Dec 19 14:33 GCF_000001405.24_GRCh37.p12
So IIUC this doesn't work for you @HenrikBengtsson?
Ah... so this issue has now split into two difference issues. I'll comment on the failing ftp:// protocol in this reply. Yes, correct, I get an HTML document when trying to access that URL;
$ curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ | head
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<meta type="copyright" content="Copyright (C) 1996-2021 The Squid Software Foundation and contributors">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/</title>
<style type="text/css"><!--
/*
* Copyright (C) 1996-2021 The Squid Software Foundation and contributors
*
* Squid software is distributed under GPLv2+ license and includes
$ curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ | tail
</table>
</div>
<hr>
<div id="footer">
<p>Generated Thu, 21 Dec 2023 23:46:05 GMT by sec1.wynton.ucsf.edu (squid/4.15)</p>
<!-- ERR_DIR_LISTING -->
</div>
</body></html>
One can definitely blame the FTP proxy server for this one. I'll investigate this a bit more, but in summary, I found the following on https://wiki.squid-cache.org/SquidFaq/InnerWorkings#can-i-make-my-regular-ftp-clients-use-a-squid-cache:
Q. Can I make my regular FTP clients use a Squid cache?
A. Nope, its not possible. Squid only accepts HTTP requests. It speaks FTP on the server-side, but not on the client-side.
So, it could very well be that Squid doesn't support transparent FTP ("download") traffic. Now, I believe Squid is a fairly common HTTP/HTTPS/FTP/RSYNC proxy, which means other Bioconductor users out there might be affected by this (if this is truly not supported by Squid).
Yes ftp:// must be used here. https:// or http:// are not suitable for the purpose of list_ftp() (they produce a result that is too difficult to parse).
I see. But then I think passing other protocols than ftp://
should be an error.
Note that list_ftp_dir() is for internal use and is only used on URLs that actually start with ftp://.
I don't think that's correct. See my original comment;
> GenomeInfoDb::find_NCBI_assembly_ftp_dir("GCF_000001405.40", assembly_name = "GRCh38.p14")
Error in GenomeInfoDb::find_NCBI_assembly_ftp_dir("GCF_000001405.40", :
unable to find FTP dir for assembly GCF_000001405.40 in
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
> traceback()
2: stop(wmsg(msg))
1: GenomeInfoDb::find_NCBI_assembly_ftp_dir("GCF_000001405.40",
assembly_name = "GRCh38.p14")
Note that GenomeInfoDb::find_NCBI_assembly_ftp_dir()
calls:
where
> GenomeInfoDb:::.NCBI_ALL_ASSEMBLY_FTP_DIR
[1] "https://ftp.ncbi.nlm.nih.gov/genomes/all/"
list_ftp() is memoized. Dropping the protocol part of the URL allows to use clean keys for the caching.
Maybe the following can achieve the same without dropping the URL protocol in the call;
diff --git a/R/list_ftp_dir.R b/R/list_ftp_dir.R
index 99c41d1..ac98f90 100644
--- a/R/list_ftp_dir.R
+++ b/R/list_ftp_dir.R
@@ -91,16 +91,16 @@
list_ftp_dir <- function(ftp_dir, subdirs.only=FALSE, long.listing=FALSE,
recache=FALSE)
{
- ftp_dir <- .normarg_ftp_dir(ftp_dir)
+ ftp_dir_sans_protocol <- .normarg_ftp_dir(ftp_dir)
stopifnot(isTRUEorFALSE(subdirs.only))
stopifnot(isTRUEorFALSE(long.listing))
stopifnot(isTRUEorFALSE(recache))
- listing <- .cached_ftp_dir_listing[[ftp_dir]]
+ listing <- .cached_ftp_dir_listing[[ftp_dir_sans_protocol]]
if (is.null(listing) || recache) {
- doc <- .getURL2(paste0("ftp://", ftp_dir))
+ doc <- .getURL2(ftp_dir)
listing <- strsplit(doc, "\n", fixed=TRUE)[[1L]]
- .cached_ftp_dir_listing[[ftp_dir]] <- listing
+ .cached_ftp_dir_listing[[ftp_dir_sans_protocol]] <- listing
}
if (subdirs.only)
Note that list_ftp_dir() is for internal use and is only used on URLs that actually start with ftp://.
I don't think that's correct. See my original comment;
Fair enough. Forgot about that. I should probably replace
.NCBI_ALL_ASSEMBLY_FTP_DIR <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/"
with
.NCBI_ALL_ASSEMBLY_FTP_DIR <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/"
I meant to say that list_ftp_dir()
is meant to be used on URLs that actually start with ftp://
. If not, instead of raising an error, it will replace the protocol part with ftp://
. It's just a convenient feature that I like to have for my own personal satisfaction :wink:
Anyways, I'm not sure where this conversation is going. We need to use the FTP protocol to retrieve content from an FTP server. That's what list_ftp_dir()
does. What can we do to help if you are behind a proxy that is messing up FTP requests? What do you propose?
... I should probably replace ...
Yes, I think that would help clarify things. To recap, I came here because I thought the protocol was HTTPS, based on the error message. If the error message had shown FTP, I would probably have gone down another troubleshooting path.
I would also make sure to call RCurl::getUrl()
with the URL protocol still there. That feels more robust, clarifies the code, and helps debugging.
Anyways, I'm not sure where this conversation is going.
I think the above closes it.
What can we do to help if you are behind a proxy that is messing up FTP requests? What do you propose?
There are few options outside this particular issue. For a starter, if it is what I suspect that Squid doesn't support "transparent" FTP listing, it could be mentioned/documented (in a help page). I need to verify that. In the long run, it could be worth to support parsing the HTTPS output, especially since the world is slowly moving away from FTP, but also because I suspect there are other users out there behind Squid proxies.
Happy to take a PR that clarifies the FTP + Squid gotcha since I don't have access to that setup.
In the long run, it could be worth to support parsing the HTTPS output, especially since the world is slowly moving away from FTP
I thought of that. But Ensembl and NCBI still officially serve data via FTP servers so we are stuck with that for now. They might run an HTTP server on their FTP server, for convenient browsing of the data, but that's just icing on the cake. Going thru this instead of FTP would be VERY fragile since the HTTP server can add all kinds of decorations to the served pages compared to the raw FTP dir listing. In particular, this means that any time they change their HTTP setup there would be a risk that they break list_ftp_dir()
.
All this to say that I don't think that FTP access to their FTP server is going anywhere soon. But who knows...
Issue
Troubleshooting
I can reproduce this on systems where there is an FTP/HTTP/HTTPS proxy in-place. Specifically, we have environments variables like:
set up. The proxy server runs Squid 4.15.
Digging deeper, I notice that the captured output from:
is corrupted when running via the proxy, whereas it works when not running via the proxy. In turn, the reason for this is that
list_ftp_dir()
drops the URI protocol from the URL at the very beginning from calling.normarg_ftp_dir()
;https://github.com/Bioconductor/GenomeInfoDb/blob/20b9758f9061d247d6f7bcb8217cdf22fbeedaf1/R/list_ftp_dir.R#L91-L94
Because of this,
.getURL2()
;https://github.com/Bioconductor/GenomeInfoDb/blob/20b9758f9061d247d6f7bcb8217cdf22fbeedaf1/R/list_ftp_dir.R#L101
is called without the
https:
protocol, i.e.When passing the latter to a proxy server, the server cannot perform a fully transparent call, resulting in different content being return. This affects the
GenomeInfoDb::list_ftp_dir()
->GenomeInfoDb:::.getURL2()
->RCurl::getURL()
call. Here is what we'd expect to see:However, without the HTTPS: protocol in the URL,
RCurl::getURL()
operates on:Some more details on what the proxy server does when there's no URI:
Conclusion
I'm pretty sure that the correct way to handle this is to not drop the URI protocol, as deliberately done in:
https://github.com/Bioconductor/GenomeInfoDb/blob/20b9758f9061d247d6f7bcb8217cdf22fbeedaf1/R/list_ftp_dir.R#L6-L20
Related
If the Bioconductor check servers run via a proxy server too, the above would explain the current check errors on Bioc release and Bioc devel;
I suspect that this proxy kerfuffle is also behind the errors reported in issues
#78 and#98.EDIT: Added more details on the proxy server and what the returned HTTP header is when there's no URI protocol.