Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

Error when running makeOrgPackage (unable to parse data tables) #5

Closed JackyHess closed 4 years ago

JackyHess commented 5 years ago

Hi there,

I'm trying to make a custom annotation package and for some reason makeOrgPackage is having trouble using my input files. Please see below for input file headers, command and error message.

Input files

> head(gene_table)
            GID        SYMBOL                                             GENENAME
1     ORTHOMCL0     ORTHOMCL0                 Inherit from KOG: transposon protein
2     ORTHOMCL1     ORTHOMCL1                             to reverse transcriptase
3    ORTHOMCL10    ORTHOMCL10                             to reverse transcriptase
4   ORTHOMCL100   ORTHOMCL100                                                 <NA>
5  ORTHOMCL1000  ORTHOMCL1000 Reverse transcriptase (RNA-dependent DNA polymerase)
6 ORTHOMCL10000 ORTHOMCL10000                                                 <NA>
> head(chr_track)
            GID CHROMOSOME
1     ORTHOMCL0      chr_1
2     ORTHOMCL1      chr_1
3    ORTHOMCL10      chr_1
4   ORTHOMCL100      chr_1
5  ORTHOMCL1000      chr_1
6 ORTHOMCL10000      chr_1
> head(GO_anno)
            GID         GO EVIDENCE
1 ORTHOMCL16377 GO:0015074      IEA
2 ORTHOMCL16377 GO:0003676      IEA
3  ORTHOMCL4510 GO:0046983      IEA
4  ORTHOMCL4511 GO:0044425      IEA
5  ORTHOMCL4511 GO:0009057      IEA
6  ORTHOMCL4511 GO:0004175      IEA

Command:

makeOrgPackage(gene_info=gene_table, chromosome=chr_track, go=GO_anno,
               version="0.1",
               maintainer="Some One <so@someplace.org>",
               author="Some One <so@someplace.org>",
               outputDir = ".",
               tax_id = "80594",
               genus="Austropaxillus",
               species="statuum",
               goTable="go")

Error message:

Error in attributes(.Data) <- c(attributes(.Data), attrib) : 
  'names' attribute [312311] must be the same length as the vector [1]
In addition: Warning messages:
1: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
2: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
3: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries

I can run the provided example just fine and am struggling to identify key differences between the example input and my input.

Any ideas what might be causing this error?

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidyr_0.8.3            AnnotationForge_1.24.0 clusterProfiler_3.10.1 KOGMWU_1.2             pheatmap_1.0.12        topGO_2.34.0           SparseM_1.77           GO.db_3.7.0           
 [9] AnnotationDbi_1.44.0   IRanges_2.16.0         S4Vectors_0.20.1       Biobase_2.42.0         graph_1.60.0           BiocGenerics_0.28.0   

loaded via a namespace (and not attached):
 [1] nlme_3.1-140        bitops_1.0-6        matrixStats_0.54.0  ggtree_1.14.6       enrichplot_1.2.0    bit64_0.9-7         RColorBrewer_1.1-2  progress_1.2.1      httr_1.4.0         
[10] UpSetR_1.3.3        tools_3.5.3         R6_2.4.0            DBI_1.0.0           lazyeval_0.2.2      colorspace_1.4-1    tidyselect_0.2.5    gridExtra_2.3       prettyunits_1.0.2  
[19] bit_1.1-14          compiler_3.5.3      xml2_1.2.0          scatterpie_0.1.2    triebeard_0.3.0     scales_1.0.0        ggridges_0.5.1      stringr_1.4.0       digest_0.6.18      
[28] DOSE_3.8.2          pkgconfig_2.0.2     rlang_0.3.4         rstudioapi_0.10     RSQLite_2.1.1       gridGraphics_0.3-0  farver_1.1.0        jsonlite_1.6        BiocParallel_1.16.6
[37] GOSemSim_2.8.0      dplyr_0.8.0.1       RCurl_1.95-4.12     magrittr_1.5        ggplotify_0.0.3     Matrix_1.2-17       Rcpp_1.0.1          munsell_0.5.0       ape_5.3            
[46] viridis_0.5.1       stringi_1.4.3       yaml_2.2.0          ggraph_1.0.2        MASS_7.3-51.4       plyr_1.8.4          qvalue_2.14.1       grid_3.5.3          blob_1.1.1         
[55] ggrepel_0.8.1       DO.db_2.9           crayon_1.3.4        lattice_0.20-38     cowplot_0.9.4       splines_3.5.3       hms_0.4.2           pillar_1.4.0        fgsea_1.8.0        
[64] igraph_1.2.4.1      reshape2_1.4.3      fastmatch_1.1-0     XML_3.98-1.19       glue_1.3.1          data.table_1.12.2   BiocManager_1.30.4  treeio_1.6.2        tweenr_1.0.1       
[73] urltools_1.7.3      gtable_0.3.0        purrr_0.3.2         polyclip_1.10-0     assertthat_0.2.1    ggplot2_3.1.1       ggforce_0.2.2       europepmc_0.3       tidytree_0.2.4     
[82] viridisLite_0.3.0   tibble_2.1.1        rvcheck_0.1.3       memoise_1.1.0      
huizhen2014 commented 4 years ago

I have encountered the same error, can't figure out why !

abelew commented 4 years ago

I just decided to try and improve my OrgDb object by using the goTable= argument. Leaving it null avoids this error. I am chasing through AnnotationForge in an attempt to hunt down the offending DBI; at first glance, I am guessing it is in a combination of AnnotationForge:::.makeNewGOTables() and AnnotationForge:::.addOntologyData(). I haven't traced it fully yet, though; but if I were to guess, I think the error is coming from the dbGetQuery() call on the first line of makeNewGOTTables().

dvantwisk commented 4 years ago

We are looking into this issue now.

abelew commented 4 years ago

I decided to poke at this over the weekend. For my scenario, I found a table naming collision. So I renamed my table and all was solved. In the process of tracing this, I replaced most of the dbGetQuery() calls in makeOrgPackage.R/makeOrgPackageFromNCBI.R with dbExecute() and removed some unneeded paste()s in the various SQL strings. Thus I am now only getting a couple of warnings when calling makeOrgPackage(). If these changes are of interest, I would be happy to send a PR or diff. However, the changes are a bit messy at the moment, I left all the original statements commented in place in case my changes were incorrect and I have not yet deleted them.

dvantwisk commented 4 years ago

I'm glad to hear things are working a little better. I've been trying to replicate your problem, but am not able to with the given information. I'd be happy to look further into this if I could get a reproducible example. It looks like this is an issue that several people are having so I would like to make a change to fix this that we can push to the next version of AnnotationForge.

abelew commented 4 years ago

Oh, to replicate just make sure that there is an existing table named 'go' with some data in it.

dvantwisk commented 4 years ago

Just so we have a common example, can you write it in code?

Kayla-Morrell commented 4 years ago

@JackyHess - To potentially fix this issue/reproduce it, it would be helpful to know what data you are using. I cannot run the code you provide without having the data. If you could provide some insight to the data used then we can work on fixing this issue. Thanks!

abelew commented 4 years ago

TL;DR: I fixed it for myself, and the problem is obscure and minor. The only likely thing of use to you is the fact that, while searching for it, I switched out some deprecated sqlite statements with what I think are the new recommendation.

Here is the long version and I think it will show why I didn't just give you the exact data and code in my first query; it is in the middle of a long series of data collection tasks.

My problems cropped up when invoking the make_eupath_orgdb() function, available here:

https://github.com/abelew/EuPathDB/blob/master/R/make_eupath_orgdb.R

The goal is to create orgdb instances from the various webservices under the eupathdb.org umbrella. The error came about due to how I was creating the arguments for makeOrgPackage() (the invocation is at line 413, but the arguments are created at lines: 250-263). Line 259 shows the cause of the error, if it says '"goTable" = "go"', then when makeOrgPackage() makes its own go table, there will be two of them, which does not end well. The simple solution for me was to simply change it (which I did). The slightly more complex solution would be to have some logic in makeOrgPackage() to check for an existing table named 'go' and rename/remove/warn it, but that is almost certainly more trouble than it is worth.

For a truly reproducible example, one must change line 259 back to '"goTable" = "go"', install the package, and run the following (using my favorite species as an example):

meta <- download_eupath_metadata(webservice="tritrypdb")
lm_entry <- get_eupath_entry(species="Leishmania major", metadata=meta)
test_major <- make_eupath_orgdb(entry=lm_entry)
computbiol commented 4 years ago

When I run the code, I also meet the question.

Kayla-Morrell commented 4 years ago

@computbiol - Would you be able to provide a print out of the error that you get? This will help me better understand where you are having issues.

Also, have you thought of utilizing the Bioconductor org.At.tair.db annotation package? This would bypass the need to create your own.

computbiol commented 4 years ago

@computbiol - Would you be able to provide a print out of the error that you get? This will help me better understand where you are having issues.

Also, have you thought of utilizing the Bioconductor org.At.tair.db annotation package? This would bypass the need to create your own.

This is the error. Because i need create annotation of other species(GS115), so i want to create my own.

Populating genes table:
genes table filled
Populating go table:
go table filled
Populating pub_info table:
pub_info table filled
Populating symbol_info table:
symbol_info table filled
Populating function_info table:
Error: NOT NULL constraint failed: function_info.SHORT_DESCRIPTION
In addition: There were 28 warnings (use warnings() to see them)
Execution halted
Kayla-Morrell commented 4 years ago

@computbiol - This seems to be a completely different issue. The original reported issues deals with the error:

Error in attributes(.Data) <- c(attributes(.Data), attrib) : 
  'names' attribute [312311] must be the same length as the vector [1]

Your error message is different, indicating there is a NOT NULL constraint that failed in the SHORT_DESCRIPTION column of the func_df table. I've made a slight modification to your code and was able to build the package with no issues. My changes are to lines 58-61 of the org.At.tair.db.R file, see them below:

# I would edit the DESCRIPTION first
# and be sure to drop the ';' from the DESCRIPTION as well
func_df$DESCRIPTION <- gsub(";\\(source:Araport11\\)","", func_df$DESCRIPTION)

# then do the ifelse() for SHORT_DESCRIPTION
# but instead of NA use the DESCRIPTION
func_df$SHORT_DESCRIPTION <- ifelse(nchar(func_df$SHORT_DESCRIPTION) == 0, 
    func_df$DESCRIPTION, func_df$SHORT_DESCRIPTION)

I assumed that if SHORT DESCRIPTION is NA then the DESCRIPTION should be used instead. I think the function is expecting that there be no NA's in this column. I could be wrong making this assumption but it works nonetheless.

...
Creating package in /Users/ka36530_ca/open_issues/org.At.tair.db/org.Atair10.eg.db
Now deleting temporary database file
[1] "/Users/ka36530_ca/open_issues/org.At.tair.db/org.Atair10.eg.db"
There were 50 or more warnings (use warnings() to see the first 50)
> install.packages("./org.Atair10.eg.db", repos = NULL,
+                  type = "source")
Installing package into ‘/Users/ka36530_ca/R-stuff/bin/R-4-0/4.0-Bioc-3.12/library’
(as ‘lib’ is unspecified)
* installing *source* package ‘org.Atair10.eg.db’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (org.Atair10.eg.db)

I hope this helps!