makeOrgPackage: out of memory

yanpd01 commented 2 years ago

Description

I have a table with 5,375,075 rows and 3 columns. No matter how big the server memory is, the makeOrgPackage will report an error: out of memory. But when I crop the table to 3,500,000*3, the makeOrgPackage works fine. How can I solve this problem?

Here's some information:

commands

library(AnnotationForge)
load("gene2go.rda")
dim(gene2go)
makeOrgPackage(
    go = gene2go,
    maintainer = "zhangsan <zhangsan@genek.cn>",
    author = "zhangsan",
    outputDir = "./",
    tax_id = 0000,
    genus = "M",
    species = "y10",
    goTable = "go",
    version = "1.0"
)
makeOrgPackage(
    go = gene2go[1:3500000, ],
    maintainer = "zhangsan <zhangsan@genek.cn>",
    author = "zhangsan",
    outputDir = "./",
    tax_id = 0000,
    genus = "M",
    species = "y11",
    goTable = "go",
    version = "1.0"
)

Error information:

....
'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_bp table:
go_bp table filled
Populating go_cc table:
go_cc table filled
Populating go_mf table:
go_mf table filled
Error: out of memory

system information:

OS: CentOS Linux release 7.5.1804 (Core)
platform: x86_64-pc-linux-gnu
CPU: Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz
CPU(s): 192   ( 4(Sockets) * 24(Cores) * 2(Threads) )
Total online memory: 1.5T
R: R version 4.2.0 (2022-04-22)
AnnotationForge: 1.38.0

files

gene2go.rda.zip

vjcitn commented 1 year ago

tagging @Kayla-Morrell @jmacdon

jmacdon commented 1 year ago

@vjcitn , @yanpd01, @Kayla-Morrell This seems like an edge case. The problem occurs when the go_all tables are constructed, which maps each 'gene' to the direct GO terms as well as all ancestors. The source data has over 5M 'genes', which don't appear to be genes, as they have GIDs like 'Cluster123456`. Are there organisms with over 5M genes? It's possible I suppose. Anyway.

Generating the go_all table by necessity ends up making a much larger data.frame than we started with (which was simply the mapping of genes to their direct term), because we are mapping to the direct term plus all the ancestor terms. On average this is approximately a 15X increase in the number of rows. We start with just over 5M rows in the gene2go object, and in the end we have three data.frames each with 16M - 45M rows (these go into the SQLite db as the go_bp_all, go_mf_all, and go_cc_all tables), and then we rbind those three data.frames to make the go_all table.

There are also some large list objects made along the way, so it is not surprising that R runs out of memory. There has to be tons of copying of objects, and just holding the three go_xx_all data.frames plus the go_all data.frame (over 160M rows, total) will be problematic.

This could probably be fixed by attaching the GO.sqlite DB to the newly generated DB and generating the tables directly using SQL queries instead of pulling the data into R and then dumping back into the new DB. But that would necessitate writing and debugging new code for what appears to be an edge case (this hasn't IIRC ever come up before).

vjcitn commented 1 year ago

Thanks for all these details Jim. Sounds like a case for a community-developed makeBigOrgPackage that addresses these issues for organisms with millions of genes. We will review pull requests as they emerge. OK to close?

jmacdon commented 1 year ago

OK to close.

hpages commented 1 year ago

Isn't that a good use case for data.table?

Bioconductor / AnnotationForge