Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

Error when running makeOrgPackageFromNCBI() from vignette #11

Closed janstrauss1 closed 4 years ago

janstrauss1 commented 4 years ago

Hi,

I'm trying to build my own organism annotation package for Plasmodium falciparum (tax_id = "36329") using makeOrgPackageFromNCBI() as described at https://support.bioconductor.org/p/118443/.

Unfortunately, I keep receiving various error messages so I tried to use the exact same function call as outlined in the bioconductor vignette to build an organism package for zebrafinch.

> makeOrgPackageFromNCBI(version = "0.1",
+                        author = "Some One <so@someplace.org>",
+                        maintainer = "Some One <so@someplace.org>",
+                        outputDir = ".",
+                        tax_id = "59729",
+                        genus = "Taeniopygia",
+                        species = "guttata"
+                        )

Unfortunately, even this throws an error indicating that NCBI url access fails for gene2accession.gz:

If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
Error in .tryDL(url, tmp) : url access failed after
4
attempts; url:
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
In addition: Warning message:
In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().

Yet, checking the ftp site at ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ manually, gene2accession.gz appears to be there.

Is this just a temporary error that I get? Do you get the same error?

I'd appreciate any feedback!

Thanks in advance,

Jan

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] AnnotationForge_1.28.0  GenomeInfoDb_1.22.1     biomaRt_2.42.1         
 [4] GO.db_3.10.0            org.Pf.plasmo.db_3.10.0 pkgconfig_2.0.3        
 [7] AnnotationDbi_1.48.0    IRanges_2.20.2          S4Vectors_0.24.4       
[10] Biobase_2.46.0          BiocGenerics_0.32.0    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6           pillar_1.4.3           compiler_3.6.3        
 [4] BiocManager_1.30.10    dbplyr_1.4.2           bitops_1.0-6          
 [7] prettyunits_1.1.1      tools_3.6.3            progress_1.2.2        
[10] digest_0.6.25          bit_1.1-15.2           lifecycle_0.2.0       
[13] tibble_3.0.0           RSQLite_2.2.0          memoise_1.1.0         
[16] BiocFileCache_1.10.2   rlang_0.4.5            cli_2.0.2             
[19] DBI_1.1.0              curl_4.3               GenomeInfoDbData_1.2.2
[22] stringr_1.4.0          httr_1.4.1             dplyr_0.8.5           
[25] rappdirs_0.3.1         vctrs_0.2.4            askpass_1.1           
[28] hms_0.5.3              tidyselect_1.0.0       bit64_0.9-7           
[31] glue_1.4.0             R6_2.4.1               fansi_0.4.1           
[34] XML_3.99-0.3           purrr_0.3.3            blob_1.2.1            
[37] magrittr_1.5           ellipsis_0.3.0         assertthat_0.2.1      
[40] stringi_1.4.6          RCurl_1.98-1.1         openssl_1.4.1         
[43] crayon_1.3.4          
janstrauss1 commented 4 years ago

I've tried to solve the issue by manually downloading the following files from the NCBI ftp server to my local working directory:

[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz

I then re-run the makeOrgPackageFromNCBI() function call from the vignette but using rebuildCache = FALSE:

> makeOrgPackageFromNCBI(version = "0.1",
+                        author = "Some One <so@someplace.org>",
+                        maintainer = "Some One <so@someplace.org>",
+                        outputDir = ".",
+                        tax_id = "59729",
+                        genus = "Taeniopygia",
+                        species = "guttata",
+                        rebuildCache = FALSE
+                        )

Unfortunately, this still throws error and no organism package seems to be built:

preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Error: no such table: altGO_date
In addition: Warning messages:
1: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
2: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
3: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
4: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
5: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().

Obviously some table is missing. Not sure if this is an issue due to lack of organism-specific information or a more general issue?!

Many thanks in advance for your feedback,

Jan

lshep commented 4 years ago

We were experiencing some intermittent connectivity issues this past week with ncbi and ensembl which I think would explain the first reported ERROR. I believe the second Error: no such table: altGO_date results in trying to use a cache that didn't complete properly and didn't fully propagate.

Following up from the section you posted in the AnnotationDbi issue. Before trying again could you do the following since I am currently unable to reproduce this.

Can you please check the results from BiocManager::valid( ) and if needed run BiocManager::install( ) selecting 'a' to update any out of date packages. Once a valid installation with updated packages is established please try again and if you receive the same ERROR could you please provide the results of running traceback( )

We will figure this out and how to get it running for you!

janstrauss1 commented 4 years ago

Dear @lshep,

many thanks for your help and looking into this! All my packages are up to date (and should have been previously):

> BiocManager::install()
Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.3 (2020-02-29)
> BiocManager::valid()
[1] TRUE

So I re-run the example from the vignette, which now appears to run fine apart from some warnings:

> makeOrgPackageFromNCBI(version = "0.1",
+                        author = "Some One <so@someplace.org>",
+                        maintainer = "Some One <so@someplace.org>",
+                        # outputDir = ".",
+                        tax_id = "59729",
+                        genus = "Taeniopygia",
+                        species = "guttata",
+                        rebuildCache = TRUE
+                        )
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated with ensembl IDs.
processing ensembl gene id data
making the OrgDb package ...
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating chromosomes table:
chromosomes table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
Populating ensembl table:
ensembl table filled
table metadata filled
'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_bp table:
go_bp table filled
Populating go_cc table:
go_cc table filled
Populating go_mf table:
go_mf table filled
'select()' returned many:1 mapping between keys and columns
Populating go_bp_all table:
go_bp_all table filled
Populating go_cc_all table:
go_cc_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled
Creating package in /Users/my/working/directory/org.Tguttata.eg.db 
Now deleting temporary database file
complete!
[1] "org.Tguttata.eg.sqlite"
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
2: call dbDisconnect() when finished working with a connection
3: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
4: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
5: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
6: In result_fetch(res@ptr, n = n) :
  SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
...

I can then successfully install running:

> install.packages("./org.Tguttata.eg.db", type = "source", repos=NULL)
* installing *source* package ‘org.Tguttata.eg.db’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (org.Tguttata.eg.db)

Seemed to be an intermittent connectivity issue as you suggested.

Many thanks for your help!

Jan

P.S.: I will now try to build my own organism annotation package for Plasmodium falciparum (tax_id = "36329")

lshep commented 4 years ago

Awsome. I'll close this for now and if you have trouble making your custom one feel free to reopen and we can work through it. Cheers,