colearendt / xlsx

An R package to interact with Excel files using the Apache POI java library
https://colearendt.github.io/xlsx/
85 stars 32 forks source link

Infinite Memory Bug #101

Open pbarill opened 6 years ago

pbarill commented 6 years ago

There's a bug that limits the use of this package, unless you have infinite memory. When a task is performed, it takes some memory (duh), but memory consumption is strictly non-decreasing. Successfully completed that task? Try repeating it. Even if I allow more memory, at some point it becomes unmanageable.

I know this issue has been raised before elsewhere (Google Code, stackoverflow, etc.), but is there any hope for a solution in the future? I found the same problem with XLConnect. Question: Is it common to everything relying on rJava?

Try this:

library(xlsx)

mock <- data.frame(matrix(0, 500,500))
filename <- "lametest.xlsx"
if (file.exists(filename))
  file.remove(filename)

for (i in 1:10) {
    if (file.exists(filename)) {
            wb <- loadWorkbook(filename)
    } else {
            wb  <- createWorkbook()
    }
  thatSheet <- createSheet(wb, as.character(i))
  addDataFrame(mock, thatSheet)
  saveWorkbook(wb, filename)
  message(i)
}

At first I tried to create all sheets and then to dump the whole workbook into the file with a unique saveWorkbook` call. No way this would work.

Ok, I'll try to split that in pieces as in the above. Result: Can't make it past iteration 4 without running out of memory! I have set options(java.parameters = "-Xmx2500m"), and given that the output file should be less than 15 megabytes, this is quite disturbing.

What about rJava::.jcall("java/lang/System", method = "gc")? No effect.

After each iteration, memory usage only keeps growing. But why? It seems that whatever is created in the JVM stays there until I terminate R.

That's unfortunate as this package is well designed and superbly documented. The idea of relying on solid libraries out there is very compelling, as opposed to reinvent the wheel. But then I'm not sure that provided wheel can be used.

colearendt commented 6 years ago

That seems horrible! Thanks for the report and the reproducible example! I'll see if I have a chance to reproduce in my environment this week or next.

pbarill commented 6 years ago

Thanks for the reply and looking into this!

A few words on my environment:

R 3.4.1 rJava 0.9.10 xlsx 0.6.1 Java: openjdk version 1.8.0_171 OpenJDK Runtime Environment (IcedTea 3.8.0) (Gentoo icedtea-3.8.0) OS: Gentoo x86_64, Linux kernel 4.4.128

I also gave it a quick try on Windows (R 3.5.1, Java(TM) SE Runtime Environment (build 1.8.0_172-b11)). Not anyhow better (main difference was that it was hard to kill R when it became out of control). So that might not be a Gentoo specific issue. I might try with other distros if you (happily) can't replicate it.

colearendt commented 5 years ago

This is definitely reproducible. I think this is hitting R's copy-on-modify semantics along with Java objects not getting cleaned up in rJava / the JVM. It will take a good bit more digging to figure out if / where this could be improved. We are using a very old version of the Apache POI Java library, so it very well could be that upgrading the library could reap some performance gains / some of these object persistence leaks.

If you or someone else finds this as a blocker, the openxlsx package uses C++, and is much quicker for this type of simple task. It finishes almost instantly:

library(openxlsx)
mock <- data.frame(matrix(0, 500,500))
filename <- tempfile()
wb <- openxlsx::createWorkbook()
li <- as.list(1:10)
lapply(li, function(x){
  message(x)
  sheet <- openxlsx::addWorksheet(wb, as.character(x))
  openxlsx::writeData(wb, sheet, mock)
  }
)
openxlsx::saveWorkbook(wb, filename)