Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 977 forks source link

fread interferes with calling forecast::tsoutliers under parallel::mclapply #5500

Closed vdhanda closed 1 year ago

vdhanda commented 1 year ago

I'm using the forecast package for some time series analysis. The forecast functions are called within mclapply. When fread is called with nThread = 1, everything works as expected. However, if fread is called with nThread > 1, the mclapply hangs. Note, once the call hangs, and Ctrl-C interrupts the execution, rerunning the code will even hang on the first call corresponding to nThread = 1.

The data file is available here: https://drive.google.com/file/d/1ilzZ0UKXc25K1MBQpNgqRMAjHRJ-LZaK/view?usp=sharing

library(data.table)
library(parallel)
library(forecast)

testDTfread = function (num_threads=2) {
  pd = fread(file="./datatable_fread_testdata.csv", stringsAsFactors = F, nThread = num_threads)
  xx=mclapply (unique(pd$x_1), function (i){
    #i=2
    x = na.omit(pd[x_1==i, .(x_3, x_4)])
    x=ts(data=x$x_4, frequency=12, start = c(year(first(x$x_3)), month(first(x$x_3))))
    tsoutliers(x)
  }, mc.allow.recursive = F)
  return (xx)
}

message("Running with nThread = 1")
testDTfread(1)
message("Finished run with nThread = 1")
message("Running with nThread = 2")
testDTfread(2)
message("Finished run with nThread = 2")

Output of sessionInfo()

R version 4.2.1 (2022-06-23)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Rocky Linux 8.6 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblaso-r0.3.15.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forecast_8.18     data.table_1.14.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9        urca_1.3-3        cellranger_1.1.0  pillar_1.8.1      compiler_4.2.1    tseries_0.10-52   tools_4.2.1       xts_0.12.2       
 [9] nlme_3.1-160      lifecycle_1.0.3   tibble_3.1.8      gtable_0.3.1      lattice_0.20-45   pkgconfig_2.0.3   rlang_1.0.6       DBI_1.1.3        
[17] cli_3.4.1         rstudioapi_0.14   curl_4.3.3        dplyr_1.0.10      generics_0.1.3    vctrs_0.5.0       lmtest_0.9-40     grid_4.2.1       
[25] nnet_7.3-18       tidyselect_1.2.0  glue_1.6.2        R6_2.5.1          fansi_1.0.3       readxl_1.4.1      ggplot2_3.3.6     TTR_0.24.3       
[33] magrittr_2.0.3    scales_1.2.1      assertthat_0.2.1  quantmod_0.4.20   timeDate_4021.106 colorspace_2.0-3  fracdiff_1.5-1    quadprog_1.5-8   
[41] utf8_1.2.2        munsell_0.5.0     zoo_1.8-11       
ColeMiller1 commented 1 year ago

Thanks for the report! Could you create a self-contained minimal reproducible example?

Looking at your function, I'm not sure fread is the problem. Does as.data.table(read.csv(...)) provide similar issues? It's possible it's getting hung up in the na.omit part because data.table is trying to add an index.

As far as I can tell, a more idiomatic way of doing this without using mclapply could be:

dt = fread(file)
dt[!is.na(x_1),
   tsoutliers(ts(data = x_4,
                 frequency = 12,
                 start = c(year(first(x_3)),
                           month(first(x_3))))
              ), 
   by = x_1]
vdhanda commented 1 year ago

Hi. The report has the sample data as an attachment and also the R example code. Sorry, the formatting got tweaked. Is there a better way to provide an example?

Also, note the fread does not hang! It returns. The mclapply hangs.

Since this is a minimal example, it does not include a lot of other processing that takes place in the mclapply.

ColeMiller1 commented 1 year ago

Self-contained would mean that the data would be reproducible within the session without outside sources. For example:

dt = data.table(x_1 = sample(5, 1e5, replace = T),
                x_3 = sample(seq(as.Date('2000/01/01'), as.Date('2022/01/01'), by="day"), 1e5, replace = T),
                x_4 = rnorm(1e5))

Also, note the fread does not hang! It returns. The mclapply hangs.

The thread is titled fread interferes with calling forecast::tsoutliers under parallel::mclapply. The title is very suggestive that fread() is the problem :). In addition to the as.data.table(read.csv(...)) from above, can you try this?

options(datatable.use.index = FALSE)

And also, did you try my suggestion from before? While parallel computing can be faster than single thread, it's not always guaranteed to be faster.

vdhanda commented 1 year ago

I see what you mean about reproducing the data in code. I opted to share the actual data that was causing the problem.

And yes, in this case, it appears that fread is causing the issue as the code in mclapply only hangs when the nThread argument to fread is greater than 1. Additionally, there is no problem when read.csv is used instead of fread.

vdhanda commented 1 year ago

This appears to be an issue with R 4.2.1 and data.table. Under R version 4.2.0, the problem does not happen.

jangorecki commented 1 year ago

Parallel pkg doesn't seem to change much https://github.com/wch/r-source/commits/trunk/src/library/parallel/src/fork.c If you are able to reproduce problem using locally generated data, then please update your first post.

jangorecki commented 1 year ago

I tried to reproduce your problem (on Ubuntu and DT master branch) but couldn't. 4.2.0, 4.2.1, 4.2.2 All worked, did not hanged, printed Finished run with nThread = 2 at the end.

Anyone else is able to reproduce the problem? If not then it is likely problem with your environment.

Did you built R from source? if not, then could you try build 4.2.1 and see if it makes any difference?

wget https://cloud.r-project.org/src/base/R-4/R-4.2.1.tar.gz
tar -xf R-4.2.1.tar.gz
cd R-4.2.1
./configure --without-recommended-packages
make
./bin/R

@vdhanda could you try also devel version? 1.14.4 is recently published to cran but is far behind from master branch.

vdhanda commented 1 year ago

I'm not sure I fully understand what happened here, but the issue was unique to my environment. In case it helps someone else, here are some details.

I usually install/update R from the EPEL repo. This issue appeared with R 4.2.1. I've successfully used data.table for many years. It's unparalleled in speed and capacity to handle large amounts of data.

@jangorecki thanks for trying to reproduce the problem. It gave me the confidence to try a different installation.

To test whether it was an issue with my installation, I installed R 4.2.1 from Rstudio (https://docs.rstudio.com/resources/install-r/). And the issue disappeared. I upgraded to R 4.2.2 from Rtsudio and everything continued to work fine.

If someone with a deeper understanding of RPM and repos can create the time, it would be helpful to understand the difference between the two distributions. It took me several days to identify the problem and find a solution.

Thank you for the help.

jangorecki commented 1 year ago

In situations like this I would say it always good to start with the most canonical way to setup an environment. So no epel, rpm, rstudio; only r-project.org Running code with no ide or gui, terminal and just R.