Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

makeOrgPackageFromNCBI() and useDeprecatedStyle=TRUE can fail due to non-existent content at b2gfar.org #4

Closed hexaflexa closed 4 years ago

hexaflexa commented 5 years ago

As already mentioned in the preface of a different issue, makeOrgPackageFromNCBI() with with useDeprecatedStyle = TRUE is potentially useful when other Bioconductor packages (that need an OrgDb annotation file for a non-model organism) access the underlying SQLite tables instead of using the newer AnnotationDbi select() interface.

When data is not available in NCBI gene2go, there is an attempt to fetch alternative GO data from b2gfar.org . The code fails here since the content in b2gfar.org no longer exists.

Here, I use a non-model organism (Tetrahymena thermophila, tax_id = 312017) as a test case to demonstrate the issue / error:

getting blast2GO data as a substitute for gene2go
Error in .tryDL(url, tmp) : url access failed after
4
attempts; url:
http://www.b2gfar.org/_media/species:data:312017.annot.zip

A log with traceback and sessionInfo are included below:

> library(AnnotationForge)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Attaching package: 'IRanges'

The following object is masked from 'package:grDevices':

    windows

# A setwd() command to to change the working directory is not included here
# since it will be different for each person
#
# I purposely set rebuildCache to FALSE because I already have the 
# data downloaded from NCBI and the rebuild takes a long time
# which I don't want during debugging
#
# I also purposely set useDeprecatedStyle = TRUE because 
# OrgDb packages created using the default useDeprecatedStyle = FALSE
# have fewer SQLite tables, different spellings of some table fieldnames,
# as well as different upper case vs lower case.  This differences can
# break some Bioconductor packages which don't use the select() interface 
# but instead access the underlying tables using dbConnect / dbGetQuery
#
# For this organism, there is apparently no data in the gene2go from NCBI,
# so it tries an alternate source: http://www.b2gfar.org/_media/species:data:312017.annot.zip
# but b2gfar.org no longer exists

> makeOrgPackageFromNCBI(version="0.0.1",
+                        author = "First Last Name <email@address.com>",
+                        maintainer = "First Last Name <email@address.com>",
+                        outputDir = ".",
+                        tax_id = "312017",
+                        genus = "Tetrahymena",
+                        species= "thermophila",
+                        rebuildCache = FALSE,
+                        useDeprecatedStyle = TRUE)
If this is the 1st time you have run this function, it may take, a long time (over an hour) to download needed files and assemble a 12 GB cache databse in the NCBIFilesDir directory.  Subsequent calls to this function should be faster (seconds).  The cache will try to rebuild once per day.
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
Populating gene2pubmed table:
table gene2pubmed filled
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
Populating gene2accession table:
table gene2accession filled
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
Populating gene2refseq table:
table gene2refseq filled
getting data for gene2unigene
extracting data for our organism from : gene2unigene
getting all data for our organism from : gene2unigene
Populating gene2unigene table:
table gene2unigene filled
getting data for gene_info.gz
extracting data for our organism from : gene_info
Populating gene_info table:
table gene_info filled
getting data for gene2go.gz
extracting data for our organism from : gene2go
Populating gene2go table:
getting blast2GO data as a substitute for gene2go
Error in .tryDL(url, tmp) : url access failed after
4
attempts; url:
http://www.b2gfar.org/_media/species:data:312017.annot.zip
In addition: There were 20 warnings (use warnings() to see them)

> traceback()
8: stop(paste(strwrap(msg, exdent = 2), collapse = "\n"))
7: .tryDL(url, tmp)
6: .getBlast2GOData(tax_id, con)
5: .createTEMPNCBIBaseTable(con, files[i], tax_id, NCBIFilesDir = NCBIFilesDir,
       rebuildCache = rebuildCache, verbose = verbose)
4: .setupBaseDBFromDLs(files, tax_id, con, NCBIFilesDir = NCBIFilesDir,
       rebuildCache = rebuildCache, verbose = verbose)
3: makeOrgDbFromNCBI(tax_id = tax_id, genus = genus, species = species,
       NCBIFilesDir = NCBIFilesDir, outputDir, rebuildCache)
2: OLD_makeOrgPackageFromNCBI(version, maintainer, author, outputDir,
       tax_id, genus, species, NCBIFilesDir, rebuildCache = rebuildCache)
1: makeOrgPackageFromNCBI(version = "0.0.1", author = "First Last Name <email@address.com>",
       maintainer = "First Last Name <email@address.com>", outputDir = ".",
       tax_id = "312017", genus = "Tetrahymena", species = "thermophila",
       rebuildCache = FALSE, useDeprecatedStyle = TRUE)

> warnings()
Warning messages:
1: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
2: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
3: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
4: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
5: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
6: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
7: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
8: call dbDisconnect() when finished working with a connection
9: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
10: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
11: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
12: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
13: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
14: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
15: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
16: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
17: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
18: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
19: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
20: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] AnnotationForge_1.24.0 AnnotationDbi_1.44.0   IRanges_2.16.0
[4] S4Vectors_0.20.1       Biobase_2.42.0         BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0      XML_3.98-1.17   digest_0.6.18   bitops_1.0-6
 [5] DBI_1.0.0       RSQLite_2.1.1   blob_1.1.1      bit64_0.9-7
 [9] RCurl_1.95-4.11 bit_1.1-14      compiler_3.5.2  pkgconfig_2.0.2
[13] memoise_1.1.0
dvantwisk commented 4 years ago

Hi,

Apologies for the wait on this. Yes, b2gfar.org and it's resources are no longer there and it doesn't seem clear whether there is an alternative location to obtain this resource. We are looking into possible alternatives.