Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 981 forks source link

Unable to allocate TMP for items in parallel batch counting #5169

Closed matthewgson closed 3 years ago

matthewgson commented 3 years ago

I encountered an issue similar to #4295 but it seems slightly different.

I'm working with a data.table of 890 million rows and 114 columns.

When I do groupby with hour and minute variables

intraday <- dt[, .(
      Nobs = .N,
      col1 = mean(col1, na.rm = TRUE),
      col2 = mean(col2, na.rm = TRUE)
    ), keyby = .(hour(datetime), minute(datetime)]

The following error occurs:

Detected that j uses these columns: qtys_all,vols_all,qtys_f,vols_f,qtys_bd,vols_bd,qtys_mm,vols_mm,qtys_cu,vols_cu,qtys_pc,vols_pc
Finding groups using forderv ... forder.c received 890185979 rows and 2 columns
Error in forderv(byval, sort = keyby, retGrp = TRUE) :
  Unable to allocate TMP for my_n=890185979 items in parallel batch counting

I successfully have done this operation before, but only thing I added was .N. part. It worked as I removed this part.

intraday <- dt[, .(
      col1 = mean(col1, na.rm = TRUE),
      col2 = mean(col2, na.rm = TRUE)
    ), keyby = .(hour(datetime), minute(datetime)]
Detected that j uses these columns: qtys_all,vols_all,qtys_f,vols_f,qtys_bd,vols_bd,qtys_mm,vols_mm,qtys_cu,vols_cu,qtys_pc,vols_pc
Finding groups using forderv ... forder.c received 890185979 rows and 2 columns
3.230s elapsed (22.3s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'list(mean(qtys_all, na.rm = T), se_mean(qtys_all), mean(vols_all, na.rm = T), se_mean(vols_all), mean(qtys_f, na.rm = T), se_mean(qtys_f), mean(vols_f, na.rm = T), se_mean(vols_f), mean(qtys_bd, na.rm = T), '
GForce is on, left j unchanged
Old mean optimization changed j from 'list(mean(qtys_all, na.rm = T), se_mean(qtys_all), mean(vols_all,     na.rm = T), se_mean(vols_all), mean(qtys_f, na.rm = T), se_mean(qtys_f),     mean(vols_f, na.rm = T), se_mean(vols_f), mean(qtys_bd, na.rm = T),     se_mean(qtys_bd), mean(vols_bd, na.rm = T), se_mean(vols_bd),     mean(qtys_mm, na.rm = T), se_mean(qtys_mm), mean(vols_mm,         na.rm = T), se_mean(vols_mm), mean(qtys_cu, na.rm = T),     se_mean(qtys_cu), mean(vols_cu, na.rm = T), se_mean(vols_cu),     mean(qtys_pc, na.rm = T), se_mean(qtys_pc), mean(vols_pc,         na.rm = T), se_mean(vols_pc))' to 'list(.External(Cfastmean, qtys_all, T), se_mean(qtys_all), .External(Cfastmean, vols_all, T), se_mean(vols_all), .External(Cfastmean, qtys_f, T), se_mean(qtys_f), .External(Cfastmean, vols_f, T), se_mean(vols_f), '
Making each group and running j (GForce FALSE) ...

  collecting discontiguous groups took 1571.924s for 48 groups
  eval(j) took 279.962s for 48 calls
00:04:59 elapsed (00:20:27 cpu)

My server has 1TB memory and I believe there's no memory issue here though I need a thorough check.

Here's sessionInfo()

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] matrixStats_0.59.0 tictoc_1.0.1       forcats_0.5.1      stringr_1.4.0
 [5] dplyr_1.0.7        purrr_0.3.4        readr_2.0.1        tidyr_1.1.3
 [9] tibble_3.1.4       ggplot2_3.3.5      tidyverse_1.3.1    fst_0.9.4
[13] data.table_1.14.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1 haven_2.4.3      colorspace_2.0-2 vctrs_0.3.8
 [5] generics_0.1.0   utf8_1.2.2       rlang_0.4.11     pillar_1.6.2
 [9] glue_1.4.2       withr_2.4.2      DBI_1.1.1        dbplyr_2.1.1
[13] modelr_0.1.8     readxl_1.3.1     lifecycle_1.0.0  munsell_0.5.0
[17] gtable_0.3.0     cellranger_1.1.0 rvest_1.0.1      tzdb_0.1.2
[21] parallel_4.0.5   fansi_0.5.0      broom_0.7.9      Rcpp_1.0.7
[25] scales_1.1.1     backports_1.2.1  jsonlite_1.7.2   fs_1.5.0
[29] hms_1.1.0        stringi_1.7.4    grid_4.0.5       cli_3.0.1
[33] tools_4.0.5      magrittr_2.0.1   crayon_1.4.1     pkgconfig_2.0.3
[37] ellipsis_0.3.2   xml2_1.3.2       reprex_2.0.1     lubridate_1.7.10
[41] assertthat_0.2.1 httr_1.4.2       rstudioapi_0.13  R6_2.5.1
[45] compiler_4.0.5
MichaelChirico commented 3 years ago

Can you check if maybe it's related to https://github.com/Rdatatable/data.table/issues/5077? Updating to the latest dev version would solve the issue if so.

matthewgson commented 3 years ago

Definitely, I'll update and run the code once again. I'll see if 1.14.1 fixes this issue.

matthewgson commented 3 years ago

@MichaelChirico You're right, it works on 1.14.1 version. Thanks!

MichaelChirico commented 3 years ago

awesome, glad to hear it!