Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
31 stars 13 forks source link

Error with Seqinfo #7

Closed yesitsjess closed 3 years ago

yesitsjess commented 5 years ago

Hi, I'm trying to use the CERES package which depends on GenomeInfoDb. It fails due to the following error:

Seqinfo(genome="hg19") Error in function (type, msg, asError = TRUE) : Failed connect to ftp.ncbi.nlm.nih.gov:21; Connection refused

I've checked fetchExtendedChromInfoFromUCSC and it appears to be supported - is this a problem with my proxy settings? If so could I download and point the function to a file instead?

lshep commented 5 years ago

Someone might have a better thought but: What versions of R and Bioconductor are you using? sessionInfo()
Currently when I run I can get information back:

> library(GenomeInfoDb)
> Seqinfo(genome="hg19")
Seqinfo object with 93 sequences (1 circular) from hg19 genome:
  seqnames       seqlengths isCircular genome
  chr1            249250621      FALSE   hg19
  chr2            243199373      FALSE   hg19
  chr3            198022430      FALSE   hg19
  chr4            191154276      FALSE   hg19
  chr5            180915260      FALSE   hg19
  ...                   ...        ...    ...
  chrUn_gl000245      36651      FALSE   hg19
  chrUn_gl000246      38154      FALSE   hg19
  chrUn_gl000247      36422      FALSE   hg19
  chrUn_gl000248      39786      FALSE   hg19
  chrUn_gl000249      38502      FALSE   hg19

so the thought of a proxy setting could be the case. Do you know if you have issues connecting to other websites or datasets besides this one? Are you running from an institution that might have firewall and proxy set up?
There is some information in download.file about proxy that might be useful as well as some of these pages I found about setting proxy globally for R Rstudio proxy , Sys.env for proxy, and Proxy settings for R

yesitsjess commented 5 years ago

> sessionInfo() R version 3.5.0 (2018-04-23) GenomeInfoDb_1.18.2 BiocInstaller_1.32.1

I'm definitely behind a firewall but I've set up my proxy settings using Sys.setenv(http_proxy="proxy") Sys.setenv(https_proxy="proxy")

Using the curl package works fine, e.g. readLines(curl(url="http://www.google.co.uk"))

It's a bit of a pain, but I'd be happy to download the correct information from UCSC directly and change the function in CERES so it doesn't try to connect anymore, I'm just not sure which is the correct file for hg19 from their downloads page.

EDIT: done a bit of digging, think the file is this one so will have a go at pointing at that now

mtmorgan commented 5 years ago

It would be helpful to debug this. I believe the essential code is

url = "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz"
download.file(url, tempfile(), quiet = TRUE)

I would focus on getting that to work. I believe the http_proxy setting should not be "proxy", but rather the IP address of the proxy server, from the help page:

The form of 'http_proxy' should be 'http://proxy.dom.com/' or
 'http://proxy.dom.com:8080/' where the port defaults to '80' and
 the trailing slash may be omitted.

I would also experiment with setting options(download.file.method = "libcurl") or "wininet".

yesitsjess commented 5 years ago

No problem with that: > url = "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz" > download.file(url, tempfile()) trying URL 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz' Content type 'application/x-gzip' length 837 bytes ================================================== downloaded 837 bytes

Also I know it's not actually proxy but I didn't think it was a good idea to post my proxy details externally ;) I'm being paranoid perhaps & sorry for being unclear.

I think something very strange is going on... if I repeatedly spam the same command it does occasionally work.

> Sys.setenv(http_proxy="http://proxy:port") > Seqinfo(genome="hg19") Error in function (type, msg, asError = TRUE) : Failed connect to ftp.ncbi.nlm.nih.gov:21; Connection refused > Sys.setenv(http_proxy="http://proxy:port") > Seqinfo(genome="hg19") Error in function (type, msg, asError = TRUE) : Failed connect to ftp.ncbi.nlm.nih.gov:21; Connection refused > Sys.setenv(http_proxy="http://proxy:port") > Seqinfo(genome="hg19") Seqinfo object with 93 sequences (1 circular) from hg19 genome: seqnames seqlengths isCircular genome chr1 249250621 FALSE hg19 chr2 243199373 FALSE hg19 chr3 198022430 FALSE hg19 chr4 191154276 FALSE hg19 chr5 180915260 FALSE hg19 ... ... ... ... chrUn_gl000245 36651 FALSE hg19 chrUn_gl000246 38154 FALSE hg19 chrUn_gl000247 36422 FALSE hg19 chrUn_gl000248 39786 FALSE hg19 chrUn_gl000249 38502 FALSE hg19

So I can tell it's definitely something on my end, not yours. Thanks for your help :)

yesitsjess commented 5 years ago

If there's a workaround anyone's aware of using something closer to download.file(url, tempfile()) please let me know because that never fails and the fact the other fails 90% of the time is very frustrating

mtmorgan commented 5 years ago

after it fails, what does the command traceback() say?

yesitsjess commented 5 years ago
> traceback()
14: fun(structure(list(message = msg, call = sys.call()), class = c(typeName, 
        "GenericCurlError", "error", "condition")))
13: function (type, msg, asError = TRUE) 
    {
        if (!is.character(type)) {
            i = match(type, CURLcodeValues)
            typeName = if (is.na(i)) 
                character()
            else names(CURLcodeValues)[i]
        }
        typeName = gsub("^CURLE_", "", typeName)
        fun = (if (asError) 
            stop
        else warning)
        fun(structure(list(message = msg, call = sys.call()), class = c(typeName, 
            "GenericCurlError", "error", "condition")))
    }(7L, "Failed connect to ftp.ncbi.nlm.nih.gov:21; Connection refused", 
        TRUE)
12: curlPerform(curl = curl, .opts = opts, .encoding = .encoding)
11: getURL(url)
10: list_ftp_dir(url)
9: .make_assembly_report_URL(assembly_accession)
8: fetch_assembly_report(assembly_accession, AssemblyUnits = AssemblyUnits)
7: FUN(genome = names(SUPPORTED_UCSC_GENOMES)[idx], circ_seqs = supported_genome$circ_seqs, 
       assembly_accession = supported_genome$assembly_accession, 
       AssemblyUnits = supported_genome$AssemblyUnits, special_mappings = supported_genome$special_mappings, 
       unmapped_seqs = supported_genome$unmapped_seqs, drop_unmapped = supported_genome$drop_unmapped, 
       goldenPath_url = goldenPath_url, quiet = quiet)
6: fetchExtendedChromInfoFromUCSC(genome, goldenPath_url = goldenPath_url, 
       quiet = TRUE)
5: .fetch_sequence_info_for_UCSC_genome(genome)
4: fetchSequenceInfo(genome)
3: .class1(object)
2: as(fetchSequenceInfo(genome), "Seqinfo")
1: Seqinfo(genome = "hg19")
mtmorgan commented 5 years ago

So then the problematic call looks like it is

url = "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/"
res = RCurl::getURL(url)
hpages commented 3 years ago

This is an old issue and the OP didn't follow up so I'm closing it.