Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

connection error while running makeOrgPackageFromNCBI() #22

Closed Easterwoman closed 2 years ago

Easterwoman commented 2 years ago

Hi!

I'm super excited to attempt my analysis with your package! However, I've run into a quite irritating error. Every time I try to run makeOrgPackageFromNCBI() I get a reading-related error and a warning saying:

In addition: Warning message: call dbDisconnect() when finished working with a connection

This last warning lends me to believe it is a connection error with NCBI. I've run the function a number of times and it fails on different locations, all regarding the cache. I have both tried to download the files via the function and downloaded the files manually via NCBI FTP and then rebuilt the cache.

These are some of the error messages I've received:

preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz Error: no such table: main.gene2pubmed

I've gotten Error: no such table on different locations but as they seem to appear random I can't reproduce it.

If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day. preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz rebuilding the cache extracting data for our organism from : gene2pubmed getting data for gene2accession.gz rebuilding the cache extracting data for our organism from : gene2accession getting data for gene2refseq.gz rebuilding the cache extracting data for our organism from : gene2refseq getting data for gene_info.gz rebuilding the cache extracting data for our organism from : gene_info getting data for gene2go.gz rebuilding the cache extracting data for our organism from : gene2go processing gene2pubmed processing gene_info: chromosomes processing gene_info: description processing alias data processing refseq data processing accession data processing GO data Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version In addition: Warning message: call dbDisconnect() when finished working with a connection

I have the latest version of R and I've updated all my packages

platform x86_64-w64-mingw32 arch x86_64 os mingw32 system x86_64, mingw32 status major 4 minor 1.1 year 2021 month 08 day 10 svn rev 80725 language R version.string R version 4.1.1 (2021-08-10) nickname Kick Things

As of now, it feels like I'm playing the lotto of just trying and trying until it works. So my question is:

Am I right about the connection error? If I have a connection error is it possible to start the function from where it left off or do I need to start from scratch every time?

Thankful for any help! This is my very first post on GitHub so if I missed some information I will gladly add it.

phancanhtrinh commented 2 years ago

I am also experiencing this problem. It is so hard :(

I just run a code inherited from a recent nature paper: https://github.com/RoundLab/Ost_CandidaRNASeq However, I got a lot of errors with annotation forge. Would you please help me? Yesterday, I had another error on my MacOS. Today I have rerun it on Windows, and I got another error. :(

 library("biomaRt")
    library("GenomeInfoDb")
    library(AnnotationDbi)
    library(Biobase)
    library(AnnotationForge)
    library("GO.db")

    path1<-tcltk::tk_choose.dir(getwd(), "Choose the folder for Analysis")
    # A window will popup and ask you to select the folder containg the data  
    setwd(path1)
    path1

    makeOrgPackageFromNCBI(version = "0.1", 
                           author = "Trinh Phan-Canh <phan@univie.ac.at>", 
                           maintainer = "Trinh Phan-Canh <phan@univie.ac.at>", 
                           outputDir = "data", 
                           tax_id = "237561", genus = "Candida", species =  "albicans")

If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Error in function (type, msg, asError = TRUE)  : 
  error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
Easterwoman commented 2 years ago

Hi,

Great that I'm not the only one, and that you managed to submit your code correctly. heh.

I'm wondering if it's an update issue. I've seen people have similar issues when they copy a GitHub repository and they usually only need to update some GitHub thing. Have you also R 4.1.1, Bioconductor 3.13 and the packages all updated?

phancanhtrinh commented 2 years ago

Have you also R 4.1.1, Bioconductor 3.13 and the packages all updated? --> Yes, I use the same version!

jmacdon commented 2 years ago

@Easterwoman The message you see about dbDisconnect is because you are having an error, it's not the cause of the error. What makeOrgPackageFromNCBI does is to create a SQLite database with all the data that are downloaded, and if it errors out before finishing you get a complaint because you haven't disconnected from the SQLite database correctly.

You are correct that the function is frail, but that's the difficulty of downloading and parsing data programmatically. We make all sorts of assumptions about the location, form, and availability of the data, and if there is any problem connecting then you will have problems.

The issue with TLS v1 has to do with the fact that you are both on Windows, and there appears to be a problem with the Windows binary for RCurl. The reasons for which are complicated and boring, so I won't go into it here.

Anyway, I have fixed the issue and pushed the changes to both release and devel. They should be available within the next 24-48 hours. Here is what I get on Windows:

> makeOrgPackageFromNCBI(version = "0.1", 
                        author = "Trinh Phan-Canh <phan@univie.ac.at>", 
                        maintainer = "Trinh Phan-Canh <phan@univie.ac.at>", 
                        outputDir = ".", 
                        tax_id = "237561", genus = "Candida", species =  "albicans")

If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated
  with ensembl IDs.
making the OrgDb package ...
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating chromosomes table:
chromosomes table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
table metadata filled

'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_bp table:
go_bp table filled
Populating go_cc table:
go_cc table filled
Populating go_mf table:
go_mf table filled
'select()' returned many:1 mapping between keys and columns
Populating go_bp_all table:
go_bp_all table filled
Populating go_cc_all table:
go_cc_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled
Creating package in ./org.Calbicans.eg.db 
Now deleting temporary database file
complete!
[1] "org.Calbicans.eg.sqlite"

If you are impatient you can install from my GitHub:

library(BiocManager)
install("jmacdon/AnnotationForge", ref = "RELEASE_3_13")
jmacdon commented 2 years ago

@phancanhtrinh I forgot to add you to my comment. Please come to the GitHub page to read it.

Easterwoman commented 2 years ago

Thank you so much @jmacdon! Looking forward to trying it.

Easterwoman commented 2 years ago

I have no successfully made the package, installed it and loaded it. So i'll close the thread. Thank you again.