plot_caland() — out of memory error when loading CALAND() outputs

aj1s commented 3 years ago

plot_caland.r was run using 45f9a2611e265b0308ff17d614ffd3d40ccecf26. Received the following error, whether run from RStudio or command line:

Error: OutOfMemoryError (Java): GC overhead limit exceeded

aj1s commented 3 years ago

Attempting modification of plot_caland.r to remove Java dependency, replacing {XLConnect} with {readxl} (and, if need be, {writexl}) which have no external dependencies.

Note that if {readxl} functions prove too difficult to splice into the for loops of plot_caland.r – the pkg was intended to be used instead with base:: lapply() – other potential options include {openxlsx} and {rio}. See: []https://stackoverflow.com/questions/12945687/read-all-worksheets-in-an-excel-workbook-into-an-r-list-with-data-frames

The following indicates: (1) each line in plot_caland.r that calls an ‘XLConnect’ function. (All occur within for loops). (2) the closest matching function call in ‘readxl’.

# Line 333: Load Excel workbooks
# using XLConnect::loadWorkbook(), then readxl::read_xls(). Usage of the latter: read_xls(path, sheet = NULL, ...), where path must be a string.
scen_wrkbk = loadWorkbook(data_file)
scen_wrkbk = read_xls(data_file)

# Line 336: List all sheet names in an Excel workbook 
# using XLConnect::loadWorkbook(), then readxl::read_xls(). 
scen_sheets = getSheets(scen_wrkbk)                                   
scen_sheets <- excel_sheets(data_file)  

# Lines 343: Read data from a sheet in an Excel workbook (to populate ‘scen_df_list’)
# using XLConnect::loadWorkbook(), then readxl::read_xls(). 
scen_df_list[[i]] <- readWorksheet(scen_wrkbk, i, startRow = 1)       
scen_df_list[[i]] <- read_xls(data_file, sheet = i)                        

# Lines 570 AND 821:  Read data from a sheet in an Excel workbook
# using XLConnect::loadWorkbook(), then readxl::read_xls(). 
temp_df <- readWorksheet(scen_wrkbk, lind, startRow = 1)
temp_df <- read_xls(data_file, lind) 

# Lines 1648 AND 1775: Read data from a sheet in an Excel workbook
# using XLConnect::loadWorkbook(), then readxl::read_xls(). 
BC_df <- readWorksheet(scen_wrkbk, i+2, startRow = 1)
BC_df <- read_xls(data_file, i+2)

(The first argument of functions readxl::read_xls() and readxl::excelsheets() is normally the directory path to the xlsx/xls file. Applied to a variable in a for loop, the path must still be a string, so in the replacements above the {readxl} functions are pointed directly at data_file (i.e., side-stepping scen_wrkbk).)

The nested for loops that read in the data (lines 230 – 1891) were extracted from the {XLConnect} and {readxl} script versions, then "de-looped" - i.e., modified into two respective scripts allowing for comparison of incremental results between the packages when run for a single iteration. Here, the historical and alternative scenario CALAND.r outputs were modified so each xls contained only two sheets. The run script:

# this enables java to use up to 16GB of memory for reading and writing excel files
options(java.parameters = "-d64 -Xms8g -Xmx64g" )  # originally set to "-Xmx8g" 

#Install and load {readxl} and {writexl}
install.packages(c("readxl", “writexl”))
library(readxl)
library(writexl)

#load the write_caland_inputs function to memory
source("plot_caland_readxl.r")

plot_caland(
   scen_fnames = c("Proto_Hist_TwoSheets.xls",                                                
                    "Proto_Alt_TwoSheets.xls"),
    scen_snames = c("Baseline","A"),
    data_dir = "./outputs/2021_05_14_carbon_input_nwland_testB3_devAll_noOcean",
    reg = c("All_region"), 
    lt = c("All_land", "Barren", "Cultivated", "Desert", "Developed_all", "Forest", "Grassland", "Ice", "Meadow", "Riparian", "Savanna", "Shrubland", "Sparse", "Water", "Wetland", "Woodland"),
    own = c("All_own"),
    figdir = "figures"
)

Within this single iteration, both versions return identical dataframes, respectively, for scen_df_list[[i]], temp_df, and BC_df.

Returning to plot_caland_readxl.r, the full script relying on {readxl}, its run script was modified to assign the original test scenario workbooks (given as all worksheets are required for plotting that begins on line 1895):

scen_fnames = c("NWL_Proto_v1_Hist_protoA_hist_output_mean_BC1_NR120.xls", 
                "NWL_Proto_v1_Alt_A_protoA_hist_output_mean_BC1_NR120.xls")

On a fresh reboot, this run script was run from cmd line, which returned:

Error: filepath: C:\Wk_R\USCA_NCS\NWLAND\outputs\2021_05_14_carbon_input_nwland_testB3_devAll_noOcean\NWL_Proto_v1_Hist_protoA_hist_output_mean_BC1_NR120.xls
libxls error: Unable to allocate memory

As evident in this thread: []https://github.com/tidyverse/readxl/issues/598, many issues result in this error.

Given the persistent memory issues, we may need to consider the lift involved in adapting plot_caland_readxl.r to read in individual csvs rather than workbooks.

aj1s commented 3 years ago

The file size of our test xls workbooks created by CALAND.r are excessively large, each at 1.153 gb. (Each scenario workbook has 214 sheets, each with 40 rows and 7,640 columns.)

Opening these in Excel and saving to xlsx format results in file sizes of ~ 248 mb, while saving back to xls format results in ~ 504 mb. In addition to format and means of saving the files, @tchapman100 speculated that the unexpectedly large sizes of the exports from CALAND.r may reflect many empty cells being saved as part of the data.

After the original xls exports were saved as xlsx, the two run scripts – ‘2021_06_18_testingPlotCALAND.r’ and ‘2021_06_18_testingPlotCALAND_readxl’ – were modified so the resaved xlsx versions of historical and alternative scenario outputs were assigned to variable ‘scen_fnames’.

The first (sourcing plot_caland.r that relies on ({XLConnect}) was then run both by cmd line and in RStudio, both cases returning: Error: OutOfMemoryError (Java) GC overhead limit exceeded

The second (sourcing plot_caland_readxl.r that relies on ({readxl}) was then run both by cmd line and in RStudio, both cases returning:

Error in `$<-.data.frame`(`*tmp*`, "Diff", value = c(0, 0, 0, 0, 0, 0,  : 
                                                       replacement has 35 rows, data has 1
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Called from: `$<-.data.frame`(`*tmp*`, "Diff", value = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))

In RStudio, traceback() indicates the error is thrown at line 1879:

'$<.data.frame'('*tmp*' "Diff", value = c(0,0,0,0,0,0, at plot_caland_readxl.r:1879
'$<-'('*tmp*', "Diff, value = c(0,0,0,0,0,0,0,0,0,0, at plot_caland_readxdl.r:1879
plot_caland(scen_fnames = c("NWL_Proto_v1_Hist_ProtoA_hist_output_mean_BC1_NR120.xlsx", at plot_caland_readxl.r:1879

Lines 1878 and 1879:

# calculate the difference from the baseline scenario
temp_df$Diff = temp_df$Value - out_cum_ghg_df_list[[oind]][out_cum_ghg_df_list[[oind]]$Scenario == scen_lnames[1] &

sbassett commented 3 years ago

Running the unmodified plot_caland.r on COGEOPROC (256GB RAM) results in a novel error:

Start plot_caland() at Thu Jul 01 22:09:08 2021 
Error in grDevices::pdf(file = filename, ..., version = version) : 
  cannot open file './outputs/2021_05_14_carbon_input_nwland_testB3_devAll_noOcean/figures/All_region/All_land/All_own/All_region_All_land_All_own_NWL_Proto_v1_Hist_protoA_hist_output_mean_BC1_NR120_ghg_ann_comp_output.pdf'
In addition: There were 50 or more warnings (use warnings() to see the first 50)
> xlcMemoryReport()
Amount of free memory in the Java Virtual Machine (JVM):  8710.223 MB
> xlcFreeMemory()
> xlcMemoryReport()
Amount of free memory in the Java Virtual Machine (JVM):  9681.831 MB

This may be evidence that the memory error was not encountered on COGEOPROC. This error is likely the result of the full file path being 264 characters while the maximum is 260.

I did not produce a second run using caland with scen_file_arg = "NWL_Proto_v1_Alt_A_protoA_hist.xls" so will do that and try again. File path names may be too long, will record as new issue if persists once two runs are available.

FYI from second CALAND Run:

Finished CALAND at Fri Jul 02 00:22:45 2021 
There were 27 warnings (use warnings() to see them)
> xlcMemoryReport()
Amount of free memory in the Java Virtual Machine (JVM):  9289.08 MB
> xlcFreeMemory()
> xlcMemoryReport()
Amount of free memory in the Java Virtual Machine (JVM):  20278.28 MB

aj1s commented 3 years ago

Regarding the prior, separate strategy of replacing {XLConnect} (and running the code on our own workstations) . . .

It appears that extensive modification of CALAND.r and plot_caland.r would be required (i.e., not just a few localized changes) to respectively create and ingest csvs.

Incidentally, each of the 214 sheets in a scenario workbook can be readily converted into its own dataframe using {readxl} and base::lapply() :

scen_sheets <- excel_sheets(data_file)
num_scen_sheets = length(scen_sheets) 
scen_df_list <- lapply(scen_sheets, function(x) read_excel(data_file, sheet = x))
names(scen_df_list) <- scen_sheets
list2env(scen_sheets, envir=.GlobalEnv)

However, moving from these 214 dfs through the remaining nested for loops in plot_caland.r would appear to require a considerable rewrite beyond available time and conversancy.

Returning to replacing {XLConnect} in a more stepwise fashion (i.e. at/near LInes 333, 336, 343, 570, 821, 1648, and 1775 as noted in the second comment above), an attempt was made using {openxlsx} rather than {readxl}, but doing so resulted in the same error at 1879:

Error in `$<-.data.frame`(`*tmp*`, "Diff", value = c(0, 0, 0, 0, 0, 0,  : 
                                                       replacement has 35 rows, data has 1

Looking closer into the cause of this error: beginning at line 295, the empty lists created and named for the data categories (stocks, densities, cumulative flux, cumulative ghg, areas, annual flux, and annual ghgs) are not being populated through the loops over the worksheets, lines 400 through 1925.

sbassett commented 3 years ago

plot_caland() appears to be working on COGEOPROC.

sbassett commented 3 years ago

Code seems to be running fine on COGEOPROC! File path names need to be shorter than in the current GitHub repo (see #72) but no memory issues.

Start plot_caland() at Mon Jul 05 08:49:52 2021 
Finish plot_caland() at Mon Jul 05 10:35:22 2021 
There were 50 or more warnings (use warnings() to see the first 50)
> xlcMemoryReport()
Amount of free memory in the Java Virtual Machine (JVM):  3228.484 MB
> xlcFreeMemory()
> xlcMemoryReport()
Amount of free memory in the Java Virtual Machine (JVM):  19926.21 MB
> xlcMemoryReport()

sbassett commented 3 years ago

It may be worth experimenting a bit because it's unlikely to be a coincidence that the JVM maxed out at ~20GB when I specified 200GB.

See tables 5-2 and 5-3 here https://docs.oracle.com/cd/E15523_01/web.1111/e13814/jvm_tuning.htm#PERFM161 These look like different virtual machines, but the syntax for setting initial and maximum memory are consistent with what we use to get XLConnect to work.

It appears that we're specifying the max memory with the correct units ("g" or "G" for GB).

TNC-NMFO / NWLAND

plot_caland() — out of memory error when loading CALAND() outputs #71