colearendt / xlsx

An R package to interact with Excel files using the Apache POI java library
https://colearendt.github.io/xlsx/
86 stars 32 forks source link

Obscure error message from unclear error #203

Open e-clin opened 1 year ago

e-clin commented 1 year ago

I tried to use the purrr and xlsx packages together to read multiple password-protected .xlsx files in a folder into a nested data frame. Each file has multiple sheets. I plan to:

  1. List and store the file names in a "file" column.
  2. Use purrr::map_chr() and xlsx::getSheets() to get the names of the sheets in each file in a "sheets" column.
  3. Use a nested purrr::map() and purrr::map_dfr() to apply xlsx::read.xlsx() to every cell in the "file" column and every cell in the "sheets" column to read all sheets in a file and row-bind them into a data frame into a "data" column.

My code is as follows:

df <- tibble(
  file = list.files(path_to_files),
  sheets = map(
    file,
    \(file) paste0(path_to_files, file) %>%
      xlsx::loadWorkbook() %>%
      xlsx::getSheets() # %>% 
  #    names()
  )#,
  #  Code below has not been run
  #  data = map(
  #    file,
  #    \(file) map_dfr(
  #      sheets,
  #      \(sheet) xlsx::read.xlsx(
  #         paste0(path_to_files, file),
  #         sheetName = sheet
  #      )
  #    )
  #  )
)

Somehow this code only ran once. When I tried to add the names() line to extract the list names (the sheet names) from the Java objects in the "sheets" column, I encountered the following error:

Error in ._jobjRef_dollar(x[["jobj"]], name) : no field, method or inner class called 'use_cli_format'

The Traceback seemed to suggest that there was some overhead limit exceedance error:

stop(structure(list(message = "java.lang.OutOfMemoryError: GC overhead limit exceeded", call = .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, .jcast(if (inherits(o, "jobjRef") || inherits(o, "jarrayRef")) o else cl, "java/lang/Object"), .jnew("java/lang/String", ...

Even if I removed the newly added names() line didn't help. I had to restart the R session to be able to rerun the code.

I also had this problem before with xlsx::read.xlsx(), when I was experimenting with my code. Once the above error occurred, even previously working code ran into this problem. Again, I had to restart the R session to be able to rerun the code.

Was it just me doing something wrong?

colearendt commented 2 months ago

Yeah, this sounds like an issue with RAM.

I thought we had some docs on this, but apparently not. You have to modify the setup for the JVM before it is started (i.e. before you load any packages, because many packages will start the JVM).

options(java.parameters = "-Xmx8000m")

https://stackoverflow.com/a/44143679/6570011