Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

Error in .addMapCounts() when using makeOrgPackageFromNCBI() with useDeprecatedStyle = TRUE #3

Closed hexaflexa closed 4 years ago

hexaflexa commented 5 years ago

makeOrgPackageFromNCBI() with with useDeprecatedStyle = TRUE is potentially useful when other Bioconductor packages (that need an OrgDb annotation file for a non-model organism) access the underlying SQLite tables instead of using the newer AnnotationDbi select() interface.

There is a completely separate issue which occurs when data is no available in NCBI gene2go, but I'll address that separately. Here, I use a very common organism (Homo sapiens, tax_id = 9606) as a test case to demonstrate the issue / error.

.getMapCounts() ends up with a 13x13 data.frame which should be 13x2. This causes the SQL insert query (which is expecting two parameters) to fail:

Error in result_bind(res@ptr, params) :
  Query requires 2 params; 13 supplied.

A full log with traceback and sessioninfo are included below:

> library(AnnotationForge)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Attaching package: 'IRanges'

The following object is masked from 'package:grDevices':

    windows

# A setwd() command to to change the working directory is not included here
# since it will be different for each person
#
# I purposely set rebuildCache to FALSE because I already have the 
# data downloaded from NCBI and the rebuild takes a long time
# which I don't want during debugging
#
# I also purposely set useDeprecatedStyle = TRUE because 
# OrgDb packages created using the default useDeprecatedStyle = FALSE
# have fewer SQLite tables, different spellings of some table fieldnames,
# as well as different upper case vs lower case.  This differences can
# break some Bioconductor packages which don't use the select() interface 
# but instead access the underlying tables using dbConnect / dbGetQuery
#
# I know that org.Hs.eg.db exists, but I am purposely using tax_id = 9606 
# as a test case.  What I really want to do is to create deprecatedStyle=TRUE
# custom OrgDbs for non-model organisms, but I am encountering a number of problems.
# But if I cannot even create a custom org.Hsapiens.eg.db using 
# makeOrgPackageFromNCBI(), then it will be even harder for non-model organisms
#
> makeOrgPackageFromNCBI(version="0.0.1", 
                       author = "First Last Name <email@address.com>", 
                       maintainer = "First Last Name <email@address.com>", 
                       outputDir = ".", 
                       tax_id = "9606", 
                       genus = "Homo", 
                       species= "sapiens", 
                       rebuildCache = FALSE,
                       useDeprecatedStyle = TRUE)

If this is the 1st time you have run this function, it may take, a long time (over an hour) to download needed files and assemble a 12 GB cache databse in the NCBIFilesDir directory.  Subsequent calls to this function should be faster (seconds).  The cache will try to rebuild once per day.
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
Populating gene2pubmed table:
table gene2pubmed filled
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
Populating gene2accession table:
table gene2accession filled
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
Populating gene2refseq table:
table gene2refseq filled
getting data for gene2unigene
extracting data for our organism from : gene2unigene
getting all data for our organism from : gene2unigene
Populating gene2unigene table:
table gene2unigene filled
getting data for gene_info.gz
extracting data for our organism from : gene_info
Populating gene_info table:
table gene_info filled
getting data for gene2go.gz
extracting data for our organism from : gene2go
Populating gene2go table:
table gene2go filled

table metadata filled
table map_metadata filled
Populating genes table:
genes table filled
Populating gene_info_temp table:
gene_info_temp table filled
Populating alias table:
alias table filled
Populating chromosomes table:
chromosomes table filled
Populating pubmed table:
pubmed table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating unigene table:
unigene table filled
Dropping GO IDs that are too new for the current GO.db
Dropping GO IDs that are too new for the current GO.db
Dropping GO IDs that are too new for the current GO.db
Populating go_bp table:
go_bp table filled
Populating go_mf table:
go_mf table filled
Populating go_cc table:
go_cc table filled
Populating go_bp_all table:
go_bp_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_cc_all table:
go_cc_all table filled
dropping tablegene2pubmedgene2accessiongene2refseqgene2unigenegene_infogene2go
Making GO views

Error in result_bind(res@ptr, params) :
  Query requires 2 params; 13 supplied.
In addition: There were 50 or more warnings (use warnings() to see the first 50)

# NOTE: all of the warnings are things like this:
1: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries

> traceback()
16: stop(list(message = "Query requires 2 params; 13 supplied.",
        call = result_bind(res@ptr, params), cppstack = list(file = "",
            line = -1L, stack = "C++ stack not available on this system")))
15: result_bind(res@ptr, params)
14: db_bind(res, as.list(params), ..., allow_named_superset = FALSE)
13: dbBind(rs, params)
12: dbBind(rs, params)
11: .local(conn, statement, ...)
10: dbSendQuery(conn, statement, ...)
9: dbSendQuery(conn, statement, ...)
8: .local(conn, statement, ...)
7: dbGetQuery(con, sql, unclass(unname(data)))
6: dbGetQuery(con, sql, unclass(unname(data)))
5: .populateBaseTable(con, sql, data, "map_counts")
4: .addMapCounts(con, tax_id, genus, species)
3: makeOrgDbFromNCBI(tax_id = tax_id, genus = genus, species = species,
       NCBIFilesDir = NCBIFilesDir, outputDir, rebuildCache)
2: OLD_makeOrgPackageFromNCBI(version, maintainer, author, outputDir,
       tax_id, genus, species, NCBIFilesDir, rebuildCache = rebuildCache)
1: makeOrgPackageFromNCBI(version = "0.0.1", author = "First Last Name <email@address.com>",
       maintainer = "First Last Name <email@address.com>", outputDir = ".",
       tax_id = "9606", genus = "Homo", species = "sapiens", rebuildCache = FALSE,
       useDeprecatedStyle = TRUE)

# By interactive debugging of .addMapCounts(), the data.frame in data is 13 by 13,
# but I think it should be 13 by 2.  The 13 columns of data are what lead to the actual
# error in .populateBaseTable(), because the sql statement is expecting to insert
# two values at a time, not 13 values.

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] AnnotationForge_1.24.0 AnnotationDbi_1.44.0   IRanges_2.16.0
[4] S4Vectors_0.20.1       Biobase_2.42.0         BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0      GO.db_3.7.0     XML_3.98-1.17   digest_0.6.18
 [5] bitops_1.0-6    DBI_1.0.0       RSQLite_2.1.1   blob_1.1.1
 [9] bit64_0.9-7     RCurl_1.95-4.11 bit_1.1-14      compiler_3.5.2
[13] pkgconfig_2.0.2 memoise_1.1.0