Closed yanpd01 closed 1 year ago
tagging @Kayla-Morrell @jmacdon
@vjcitn , @yanpd01, @Kayla-Morrell This seems like an edge case. The problem occurs when the go_all tables are constructed, which maps each 'gene' to the direct GO terms as well as all ancestors. The source data has over 5M 'genes', which don't appear to be genes, as they have GIDs like 'Cluster123456`. Are there organisms with over 5M genes? It's possible I suppose. Anyway.
Generating the go_all table by necessity ends up making a much larger data.frame
than we started with (which was simply the mapping of genes to their direct term), because we are mapping to the direct term plus all the ancestor terms. On average this is approximately a 15X increase in the number of rows. We start with just over 5M rows in the gene2go object, and in the end we have three data.frames
each with 16M - 45M rows (these go into the SQLite db as the go_bp_all, go_mf_all, and go_cc_all tables), and then we rbind
those three data.frames
to make the go_all table.
There are also some large list
objects made along the way, so it is not surprising that R runs out of memory. There has to be tons of copying of objects, and just holding the three go_xx_all data.frames
plus the go_all data.frame
(over 160M rows, total) will be problematic.
This could probably be fixed by attaching the GO.sqlite DB to the newly generated DB and generating the tables directly using SQL queries instead of pulling the data into R and then dumping back into the new DB. But that would necessitate writing and debugging new code for what appears to be an edge case (this hasn't IIRC ever come up before).
Thanks for all these details Jim. Sounds like a case for a community-developed makeBigOrgPackage that addresses these issues for organisms with millions of genes. We will review pull requests as they emerge. OK to close?
OK to close.
Isn't that a good use case for data.table?
Description
I have a table with 5,375,075 rows and 3 columns. No matter how big the server memory is, the
makeOrgPackage
will report an error: out of memory. But when I crop the table to 3,500,000*3, themakeOrgPackage
works fine. How can I solve this problem?Here's some information:
commands
Error information:
system information:
files
gene2go.rda.zip