Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

timeout adjustment not respected #49

Open lshep opened 1 year ago

lshep commented 1 year ago

This may not be needed based on the recent PR but when running, a timeout occurs and setting options(timeout) does not work

processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated
  with ensembl IDs.
TaxID: 9646
TaxID: 8839
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': Timeout was reached: [www.ensembl.org:443] Operation timed out after 10001 milliseconds with 0 bytes received

> options('timeout')
$timeout
[1] 1e+16
jmacdon commented 1 year ago

@lshep I think that's just an error from bmRequest in biomaRt, which has a hard-coded timeout of 10

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': Timeout was reached: [www.ensembl.org:443] Operation timed out after 10004 milliseconds with 0 bytes received
> traceback()
21: h(simpleError(msg, call))
20: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "Timeout was reached: [www.ensembl.org:443] Operation timed out after 10004 milliseconds with 0 bytes received", 
        base::quote(curl::curl_fetch_memory(url, handle = handle)))
19: curl::curl_fetch_memory(url, handle = handle)
18: request_fetch.write_memory(req$output, req$url, handle)
17: request_fetch(req$output, req$url, handle)
16: request_perform(req, hu$handle$handle)
15: httr::GET(request, config = httr_config, content_type("text/plain"), 
        timeout(10))
14: bmRequest(request = request, httr_config = martHTTRConfig(mart), 
        verbose = verbose)

And

> getAnywhere( bmRequest)
A single object matching 'bmRequest' was found
It was found in the following places
  namespace:biomaRt
with value

function (request, httr_config, verbose = FALSE) 
{
    if (verbose) 
        message("Attempting web service request:\n", request)
    result <- httr::GET(request, config = httr_config, content_type("text/plain"), 
        timeout(10))
    stop_for_status(result)
    result2 <- content(result, encoding = "UTF-8")
    if (is.na(result2)) {
        result2 <- content(result, encoding = "Latin1")
    }
    return(result2)
}
<bytecode: 0x00000222cb894bd8>
<environment: namespace:biomaRt>
lshep commented 1 year ago

@grimbough is this true? Is there a way we could adjust the timeout or is there a reason its hardcoded?

grimbough commented 1 year ago

The 10 second timeout is only found in part of the code which tries to test if the Ensembl server is alive. It sends a really small query, waits 10 seconds and if it doesn't hear anything back it tries another mirror. I hardcoded it to 10 because anything larger feel like eternity to the end user, and in my experience a delay of 10 seconds already means there's a problem and trying an alternative mirror is a better option. It doesn't feel like something the user gain much by having control over.

lshep commented 1 year ago

so then this occurs when I can't connect to ensembl? Do you know if they have a hit limit?

grimbough commented 1 year ago

I don't think so. They do for the Esembl REST API, but I don't think BioMart has any such restriction. Also, if my strategy works it should mean that you'd cycle through all 4 mirros (assuming they're alive) and I'd hope that stopped any rate limiting.

How many requests are you making? Is this happening after a while of successful queries?

jmacdon commented 1 year ago

@grimbough In my experience it's a pretty rare occurrence. It happens about as often as a simple attempt to generate a mart instance fails, after having cycled through all four marts. The issue here is that it will occur after a very long process (maybe an hour or so), which makes it that much worse, because the only recourse is to then rerun, which while shorter the second time through is still maybe 20-30 minutes.

jmacdon commented 1 year ago

@lshep Maybe a workaround is to do a tryCatch and cycle through maybe 5 times in case it's just an intermittent connection issue.

lshep commented 1 year ago

I'm increasing the sleep in between each query now -- I'm hoping the recently submitted PR might also fix this issue for the future.

lshep commented 1 year ago

Keep getting errors ... sometimes with a 443 sometimes 503 error .... its when trying to run the makeOrgPackageFromNCBI example

>     makeOrgPackageFromNCBI(version = "0.1",author = "Some One <so@someplace.org>",
                           maintainer = "Some One <so@someplace.org>",
                           outputDir = ".",
                           tax_id = "59729",
                           genus = "Taeniopygia",
                           species = "guttata")
If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated
  with ensembl IDs.
TaxID: 9646
TaxID: 8839
Ensembl site unresponsive, trying uswest mirror
Error in (function (cond)  : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': Service Unavailable (HTTP 503).
In addition: Warning message:
call dbDisconnect() when finished working with a connection 

I've been trying multiple times since last week and just can't get it to complete

jmacdon commented 1 year ago

@lshep It looks like you are using the extant version of AnnotationForge rather than knokknok's PR. The reason knokknok did the PR is because the existing default is to ignore the requested tax_id and just iterate through all the tax_ids that are available on the Biomart and see if there are any NCBI Gene ID -> Ensembl Gene ID mappings. There are 69 species that end up getting queried, so that's 69 times that a mart instance has to be instantiated and queried, and thus 69 times that there could be an intermittent failure to instantiate the mart instance.

This is pointless and wasteful, and the PR instead just does the query for the requested tax_id, reducing the possibility of failure to one instance.

lshep commented 1 year ago

Thanks @jmacdon Yes thats why I said it may be taken care of with PR, and no I wasn't using it until you approved it for merging.

jmacdon commented 1 year ago

OK. Even with the PR it can still be a problem.

processing GO data
Ensembl site unresponsive, trying uswest mirror
Error in (function (cond)  : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': Service Unavailable (HTTP 503).
> traceback()
12: (function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)))(structure(list(message = "Service Unavailable (HTTP 503).", 
        call = bmRequest(request = request, httr_config = httr_config, 
            verbose = verbose)), class = c("http_503", "http_500", 
    "http_error", "error", "condition")))
11: stop(http_condition(x, "error", task = task, call = call))
10: stop_for_status(result)
9: bmRequest(request = request, httr_config = httr_config, verbose = verbose)
8: .listMarts(mart = mart, host = host, verbose = verbose, httr_config = httr_config, 
       port = port, ensemblRedirect = ensemblRedirect)
7: .listEnsembl(version = version, GRCh = GRCh, mirror = mirror)
6: biomaRt::useEnsembl("ensembl", version = release)

And immediately after I got this error I reran the offending code and it worked.

> AnnotationForge:::available.ensembl.datasets("59729")
TaxID: 59729
                  59729 
"tguttata_gene_ensembl" 

@grimbough This is a different error than what @lshep orignally got though. Instead of a timeout I got a 503 error, which is what she got from her last attempt as well.

grimbough commented 1 year ago

503 is "service unavailable" which feels pretty similar to a timeout, and I suspect they're related causes on the Ensembl side. I'm currently building my own copy of the database (you're correct that it takes a frustrating amount of time!). I'll do some testing tomorrow once it's finished and I'm back in the office.

jmacdon commented 1 year ago

@grimbough Most of the processing is orthogonal to the Ensembl piece (but to speed things up a bit after having run once and ideally saved all the files including the NCBI.sqlite db, you can use rebuildCache = FALSE).

But I wouldn't use the current version of the package because of the repeated querying of Biomart for other random species. You would be better off using the devel version after @lshep merges, or just install knokknok's version first (it's knokknok/AnnotationForge, if you use BiocManager). Then you can test directly using AnnotationForge:::available.ensembl.datasets("59729")

lshep commented 1 year ago

@jmacdon / @grimbough even with using the PR code I still get errors

> options(timeout=10000000000000)
> library(devtools)
> load_all("/home/lorikern/BioconductorPackages/SoftwarePkg/AnnotationForge")
## PR branch of package 
>     makeOrgPackageFromNCBI(version = "0.1",author = "Some One <so@someplace.org>",
                           maintainer = "Some One <so@someplace.org>",
                           outputDir = ".",
                           tax_id = "59729",
                           genus = "Taeniopygia",
                           species = "guttata")
If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
TaxID: 59729
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': Timeout was reached: [www.ensembl.org:443] Operation timed out after 10001 milliseconds with 0 bytes received
In addition: Warning messages:
1: In download.file(url, tmp, quiet = TRUE, mode = "wb") :
  NAs introduced by coercion to integer range
2: In download.file(url, tmp, quiet = TRUE, mode = "wb") :
  NAs introduced by coercion to integer range
3: In download.file(url, tmp, quiet = TRUE, mode = "wb") :
  NAs introduced by coercion to integer range
4: In download.file(url, tmp, quiet = TRUE, mode = "wb") :
  NAs introduced by coercion to integer range
5: In download.file(url, tmp, quiet = TRUE, mode = "wb") :
  NAs introduced by coercion to integer range
6: In download.file(url, dest, quiet = TRUE) :
  NAs introduced by coercion to integer range
jmacdon commented 1 year ago

This is an issue with getting a connection to the Biomart server, rather than AnnotationHub. As you can see from my previous posts, this is due to a quick test (bmRequest) that is meant to check if the connection can be established. To further clarify, consider this

## function to catch errors via Martin Maechler
 tryCatch.W.E <- function(expr)
{
    W <- NULL
    w.handler <- function(w){ # warning handler
    W <<- w
    invokeRestart("muffleWarning")
    }
    list(value = withCallingHandlers(tryCatch(expr, error = function(e) e),
                     warning = w.handler),
     warning = W)
}

> zlst <- lapply(1:50, function(x) tryCatch.W.E(useEnsembl("ensembl")))
> table(sapply(zlst, function(x) is(x$value, "simpleError")))

FALSE  TRUE 
   48     2 

In this test I get a 4% failure rate to connect. Not sure if it's relevant, but the two errors were the first two. If I rerun it seems OK:

> zlst <- lapply(1:50, function(x) tryCatch.W.E(useEnsembl("ensembl")))
Ensembl site unresponsive, trying uswest mirror
Ensembl site unresponsive, trying useast mirror
Ensembl site unresponsive, trying asia mirror
Ensembl site unresponsive, trying useast mirror
Ensembl site unresponsive, trying uswest mirror
Ensembl site unresponsive, trying useast mirror
Ensembl site unresponsive, trying useast mirror
> table(sapply(zlst, function(x) is(x$value, "simpleError")))

FALSE 
   50 

There are two places that I know of where a connection is established to the Biomart server, and I think wrapping in something like

fail <- TRUE
while(fail) {
       mart <- tryCatch.W.E(useEnsembl("ensembl", datSet))
       fail <- is(mart$value, "simpleError")
}
mart <- mart$value
...

Would solve the problem.

grimbough commented 1 year ago

Looking at this more closely, I don't think the error is coming from useEnsembl(). You shouldn't ever see the Timeout or 503 response, even if it totally fails. If it checks all 4 mirrors and is unsuccessful you should end up with an error that says Unable to query any Ensembl site. That's what this test is checking for: https://github.com/grimbough/biomaRt/blob/devel/tests/testthat/test_ensemblFunctions.R#L29-L60

Given the rest of the error message, I think the error you're seeing is actually from the call to listDatasets() https://github.com/knokknok/AnnotationForge/blob/devel/R/makeOrgPackageFromNCBI.R#L1571-L1574

This also has the 10 second timeout, because it's typically just downloading a relatively small table, and in theory we've already tested that the server is working, but clearly that's not sufficient here. I can modify that to respect the options(timeout = XX), although I suspect if you have to wait more than 10 seconds you might be waiting a long time for a response. I could also set it to try and cache the table of available datasets, so you only have to make a valid connection once for that particular Mart and Ensembl version.

lshep commented 1 year ago

Thank you!

grimbough commented 1 year ago

I've committed a patch in biomaRt version 2.55.4 that means bmRequest() will use the timeout set in options(timeout). It won't help if the server is really struggling, but maybe it's an improvement. I'll work on the cache mechanism when I have more time.

lshep commented 1 year ago

@grimbough I tried with the latest 2.55.4 and knokknok-devel branch of AnnotationForge but I got a bad argument ERROR now?

processing accession data
processing GO data
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': A libcurl function was given a bad argument
In addition: Warning messages:
1: In download.file(url, tmp, quiet = TRUE, mode = "wb") :
  NAs introduced by coercion to integer range
2: In download.file(url, tmp, quiet = TRUE, mode = "wb") :
  NAs introduced by coercion to integer range
3: In download.file(url, tmp, quiet = TRUE, mode = "wb") :
  NAs introduced by coercion to integer range
4: In download.file(url, tmp, quiet = TRUE, mode = "wb") :
  NAs introduced by coercion to integer range
5: In download.file(url, tmp, quiet = TRUE, mode = "wb") :
  NAs introduced by coercion to integer range
6: In download.file(url, dest, quiet = TRUE) :
  NAs introduced by coercion to integer range
7: In curl::handle_setopt(handle, .list = req$options) :
  NAs introduced by coercion to integer range
> traceback()
14: h(simpleError(msg, call))
13: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "A libcurl function was given a bad argument", 
        base::quote(curl::handle_setopt(handle, .list = req$options)))
12: curl::handle_setopt(handle, .list = req$options)
11: request_perform(req, hu$handle$handle)
10: httr::GET(request, config = httr_config, content_type("text/plain"), 
        timeout(options("timeout")$timeout))
9: bmRequest(request = request, httr_config = httr_config, verbose = verbose)
8: .listMarts(mart = mart, host = host, verbose = verbose, httr_config = httr_config, 
       port = port, ensemblRedirect = ensemblRedirect)
7: .listEnsembl(version = version, GRCh = GRCh, mirror = mirror)
6: biomaRt::useEnsembl("ensembl", version = release)
5: available.ensembl.datasets(tax_id, release = ensemblVersion)
4: tax_id %in% names(available.ensembl.datasets(tax_id, release = ensemblVersion))
3: prepareDataFromNCBI(tax_id, NCBIFilesDir, outputDir, rebuildCache, 
       verbose, ensemblVersion = ensemblVersion)
2: NEW_makeOrgPackageFromNCBI(version, maintainer, author, outputDir, 
       tax_id, genus, species, NCBIFilesDir, databaseOnly, rebuildCache = rebuildCache, 
       verbose = verbose, ensemblVersion = ensemblVersion)
1: makeOrgPackageFromNCBI(version = "0.1", author = "Some One <so@someplace.org>", 
       maintainer = "Some One <so@someplace.org>", outputDir = ".", 
       tax_id = "59729", genus = "Taeniopygia", species = "guttata")
grimbough commented 1 year ago

It's been working on mine, but perhaps something is different on our systems. I've not seen that error before. I've just changed how I access the timeout option. Try installing from grimbough/biomaRt again (it's still 2.55.4) and give it another try. I have no idea if cURL might complain if you give a massive timeout option. If you're trying with a huge number, perhaps you could set it to 300 seconds or something.

Maybe also just try ensembl <- useEnsembl("ensembl"); listEnsembl( ensembl, verbose = TRUE) to strip out anything else going on.

lshep commented 1 year ago

Running that works fine its when I try to run the makeOrgPackageFromNCBI that fails --- I put my timeout to something obscene --

    library(AnnotationForge)
    options(timeout=10000000000)
    makeOrgPackageFromNCBI(version = "0.1",author = "Some One <so@someplace.org>",
                           maintainer = "Some One <so@someplace.org>",
                           outputDir = ".",
                           tax_id = "59729",
                           genus = "Taeniopygia",
                           species = "guttata")
knokknok commented 1 year ago

Looks like you were indeed too obscene 😉

You have this warning: 1: In download.file(url, tmp, quiet = TRUE, mode = "wb") : NAs introduced by coercion to integer range

which probably comes from:

> as.integer(10000000000)
[1] NA
Warning message:
NAs introduced by coercion to integer range
lshep commented 1 year ago

Hahaha -- okay guess too aggressive is a thing!