Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

Error accessing url #20

Closed npokorzynski closed 2 years ago

npokorzynski commented 2 years ago

Hi,

I'm trying to build an OrgDB object following the vignette:

makeOrgPackageFromNCBI(version = "0.1", author = "Nick D. Pokorzynski <nick.pokorzynski@yale.edu>", maintainer = "Nick D. Pokorzynski <nick.pokorzynski@yale.ed>", outputDir = ".", tax_id = "588858", genus = "Salmonella", species = "enterica sv. Typhimurium 14028s", rebuildCache = TRUE)

Yet every time I run this code, after checking for validity of package, etc., I get the following error:

If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day. preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz rebuilding the cache Error in .tryDL(url, tmp) : url access failed after 4 attempts; url: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz

I know that a similar issue has been reported in the past e.g. #11, but there doesn't seem to be any actual solution to this problem other than waiting for the connection to improve? If there is, please advise. Any help is appreciated.

Thanks in advance!

For reference:

traceback() 7: stop(paste(strwrap(msg, exdent = 2), collapse = "\n")) 6: .tryDL(url, tmp) 5: .downloadData(files[i], tax_id, NCBIFilesDir = NCBIFilesDir, rebuildCache = rebuildCache, verbose = verbose) 4: .makeBaseDBFromDLs(files, tax_id, NCBIcon, NCBIFilesDir, rebuildCache, verbose) 3: prepareDataFromNCBI(tax_id, NCBIFilesDir, outputDir, rebuildCache, verbose) 2: NEW_makeOrgPackageFromNCBI(version, maintainer, author, outputDir, tax_id, genus, species, NCBIFilesDir, databaseOnly, rebuildCache = rebuildCache, verbose = verbose) 1: makeOrgPackageFromNCBI(version = "0.1", author = "Nick D. Pokorzynski <nick.pokorzynski@yale.edu>", maintainer = "Nick D. Pokorzynski <nick.pokorzynski@yale.ed>", outputDir = ".", tax_id = "588858", genus = "Salmonella", species = "enterica sv. Typhimurium 14028s", rebuildCache = TRUE)

`sessionInfo() R version 4.1.1 (2021-08-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.5

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] AnnotationHub_3.0.1 BiocFileCache_2.0.0 dbplyr_2.1.1
[4] AnnotationForge_1.34.0 pathview_1.32.0 GOSemSim_2.18.1
[7] enrichplot_1.12.2 org.Hs.eg.db_3.13.0 AnnotationDbi_1.54.1
[10] clusterProfiler_4.0.3 DOSE_3.18.1 ggridges_0.5.3
[13] ggnewscale_0.4.5 EnhancedVolcano_1.10.0 ggrepel_0.9.1
[16] ggfortify_0.4.12 ggplot2_3.3.5 dplyr_1.0.7
[19] plyr_1.8.6 tibble_3.1.3 DESeq2_1.32.0
[22] SummarizedExperiment_1.22.0 Biobase_2.52.0 MatrixGenerics_1.4.2
[25] matrixStats_0.60.0 GenomicRanges_1.44.0 GenomeInfoDb_1.28.1
[28] IRanges_2.26.0 S4Vectors_0.30.0 BiocGenerics_0.38.0

loaded via a namespace (and not attached): [1] shadowtext_0.0.8 fastmatch_1.1-3 igraph_1.2.6
[4] lazyeval_0.2.2 splines_4.1.1 BiocParallel_1.26.1
[7] digest_0.6.27 htmltools_0.5.1.1 viridis_0.6.1
[10] GO.db_3.13.0 fansi_0.5.0 magrittr_2.0.1
[13] memoise_2.0.0 Biostrings_2.60.2 annotate_1.70.0
[16] graphlayouts_0.7.1 extrafont_0.17 extrafontdb_1.0
[19] colorspace_2.0-2 rappdirs_0.3.3 blob_1.2.2
[22] crayon_1.4.1 RCurl_1.98-1.3 jsonlite_1.7.2
[25] graph_1.70.0 scatterpie_0.1.6 genefilter_1.74.0
[28] survival_3.2-12 ape_5.5 glue_1.4.2
[31] polyclip_1.10-0 gtable_0.3.0 zlibbioc_1.38.0
[34] XVector_0.32.0 DelayedArray_0.18.0 proj4_1.0-10.1
[37] Rgraphviz_2.36.0 Rttf2pt1_1.3.9 maps_3.3.0
[40] scales_1.1.1 DBI_1.1.1 Rcpp_1.0.7
[43] viridisLite_0.4.0 xtable_1.8-4 tidytree_0.3.4
[46] bit_4.0.4 httr_1.4.2 fgsea_1.18.0
[49] RColorBrewer_1.1-2 ellipsis_0.3.2 pkgconfig_2.0.3
[52] XML_3.99-0.6 farver_2.1.0 locfit_1.5-9.4
[55] utf8_1.2.2 later_1.2.0 tidyselect_1.1.1
[58] labeling_0.4.2 rlang_0.4.11 reshape2_1.4.4
[61] BiocVersion_3.13.1 munsell_0.5.0 tools_4.1.1
[64] cachem_1.0.5 downloader_0.4 cli_3.0.1
[67] generics_0.1.0 RSQLite_2.2.7 stringr_1.4.0
[70] fastmap_1.1.0 yaml_2.2.1 ggtree_3.0.3
[73] bit64_4.0.5 tidygraph_1.2.0 purrr_0.3.4
[76] KEGGREST_1.32.0 ggraph_2.0.5 nlme_3.1-152
[79] mime_0.11 ash_1.0-15 ggrastr_0.2.3
[82] KEGGgraph_1.52.0 aplot_0.0.6 DO.db_2.9
[85] compiler_4.1.1 rstudioapi_0.13 interactiveDisplayBase_1.30.0 [88] filelock_1.0.2 curl_4.3.2 beeswarm_0.4.0
[91] png_0.1-7 treeio_1.16.1 tweenr_1.0.2
[94] geneplotter_1.70.0 stringi_1.7.3 ggalt_0.4.0
[97] lattice_0.20-44 Matrix_1.3-4 vctrs_0.3.8
[100] pillar_1.6.2 lifecycle_1.0.0 BiocManager_1.30.16
[103] data.table_1.14.0 cowplot_1.1.1 bitops_1.0-7
[106] httpuv_1.6.1 patchwork_1.1.1 qvalue_2.24.0
[109] R6_2.5.0 promises_1.2.0.1 KernSmooth_2.23-20
[112] gridExtra_2.3 vipor_0.4.5 MASS_7.3-54
[115] assertthat_0.2.1 withr_2.4.2 GenomeInfoDbData_1.2.6
[118] grid_4.1.1 tidyr_1.1.3 rvcheck_0.1.8
[121] ggforce_0.3.3 shiny_1.6.0 ggbeeswarm_0.6.0`

`BiocManager::valid("AnnotationForge") 'getOption("repos")' replaces Bioconductor standard repositories, see '?repositories' for details

replacement repositories: CRAN: https://cran.rstudio.com/

[1] TRUE`

`BiocManager::install("AnnotationForge") 'getOption("repos")' replaces Bioconductor standard repositories, see '?repositories' for details

replacement repositories: CRAN: https://cran.rstudio.com/

Bioconductor version 3.13 (BiocManager 1.30.16), R 4.1.1 (2021-08-10) Warning message: package(s) not installed when version(s) same as current; use force = TRUE to re-install: 'AnnotationForge'`

vjcitn commented 2 years ago

It seems to me this specific event is network related. I tried it and

> makeOrgPackageFromNCBI(version = "0.1", author = "Nick D. Pokorzynski <nick.pokorzynski@yale.edu>", maintainer = "Nick D. Pokorzynski <nick.pokorzynski@yale.ed>", outputDir = ".", tax_id = "588858", genus = "Salmonella", species = "enterica sv. Typhimurium 14028s", rebuildCache = TRUE)
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache

But it is taking a long time ...

npokorzynski commented 2 years ago

@vjcitn Yes, it ended up being a network issue. I actually got it to compile the OrgDB object, but I was never able to use it. If I recall correctly, the OrgDB object ended up being populated with very strange identifiers, for example it did not have any gene names or symbols, but it had GIDs. However, there were only like two hundred GIDs, for a species with 5000+ genes? And none of the numbers seemed to refer to anything associated with the specific genome (at least that I could find). Nothing that I found online was able to shed light on why that might be the case. I'd love to be able to get this working though.

vjcitn commented 2 years ago

@mrjc42 might you have a look?

vjcitn commented 2 years ago

@npokorzynski Would you have a look at the NCBI records directly? Sometimes they are just inadequate or broken. If it looks like our infrastructure is to blame, please reopen this issue.