Closed vdhanda closed 1 year ago
Thanks for the report! Could you create a self-contained minimal reproducible example?
Looking at your function, I'm not sure fread
is the problem. Does as.data.table(read.csv(...))
provide similar issues? It's possible it's getting hung up in the na.omit
part because data.table is trying to add an index.
As far as I can tell, a more idiomatic way of doing this without using mclapply
could be:
dt = fread(file)
dt[!is.na(x_1),
tsoutliers(ts(data = x_4,
frequency = 12,
start = c(year(first(x_3)),
month(first(x_3))))
),
by = x_1]
Hi. The report has the sample data as an attachment and also the R example code. Sorry, the formatting got tweaked. Is there a better way to provide an example?
Also, note the fread does not hang! It returns. The mclapply hangs.
Since this is a minimal example, it does not include a lot of other processing that takes place in the mclapply.
Self-contained would mean that the data would be reproducible within the session without outside sources. For example:
dt = data.table(x_1 = sample(5, 1e5, replace = T),
x_3 = sample(seq(as.Date('2000/01/01'), as.Date('2022/01/01'), by="day"), 1e5, replace = T),
x_4 = rnorm(1e5))
Also, note the fread does not hang! It returns. The mclapply hangs.
The thread is titled fread interferes with calling forecast::tsoutliers under parallel::mclapply
. The title is very suggestive that fread()
is the problem :). In addition to the as.data.table(read.csv(...))
from above, can you try this?
options(datatable.use.index = FALSE)
And also, did you try my suggestion from before? While parallel computing can be faster than single thread, it's not always guaranteed to be faster.
I see what you mean about reproducing the data in code. I opted to share the actual data that was causing the problem.
And yes, in this case, it appears that fread is causing the issue as the code in mclapply only hangs when the nThread argument to fread is greater than 1. Additionally, there is no problem when read.csv is used instead of fread.
This appears to be an issue with R 4.2.1 and data.table. Under R version 4.2.0, the problem does not happen.
Parallel pkg doesn't seem to change much https://github.com/wch/r-source/commits/trunk/src/library/parallel/src/fork.c If you are able to reproduce problem using locally generated data, then please update your first post.
I tried to reproduce your problem (on Ubuntu and DT master branch) but couldn't.
4.2.0, 4.2.1, 4.2.2
All worked, did not hanged, printed Finished run with nThread = 2
at the end.
Anyone else is able to reproduce the problem? If not then it is likely problem with your environment.
Did you built R from source? if not, then could you try build 4.2.1 and see if it makes any difference?
wget https://cloud.r-project.org/src/base/R-4/R-4.2.1.tar.gz
tar -xf R-4.2.1.tar.gz
cd R-4.2.1
./configure --without-recommended-packages
make
./bin/R
@vdhanda could you try also devel version? 1.14.4 is recently published to cran but is far behind from master branch.
I'm not sure I fully understand what happened here, but the issue was unique to my environment. In case it helps someone else, here are some details.
I usually install/update R from the EPEL repo. This issue appeared with R 4.2.1. I've successfully used data.table for many years. It's unparalleled in speed and capacity to handle large amounts of data.
@jangorecki thanks for trying to reproduce the problem. It gave me the confidence to try a different installation.
To test whether it was an issue with my installation, I installed R 4.2.1 from Rstudio (https://docs.rstudio.com/resources/install-r/). And the issue disappeared. I upgraded to R 4.2.2 from Rtsudio and everything continued to work fine.
If someone with a deeper understanding of RPM and repos can create the time, it would be helpful to understand the difference between the two distributions. It took me several days to identify the problem and find a solution.
Thank you for the help.
In situations like this I would say it always good to start with the most canonical way to setup an environment. So no epel, rpm, rstudio; only r-project.org Running code with no ide or gui, terminal and just R.
I'm using the forecast package for some time series analysis. The forecast functions are called within mclapply. When fread is called with nThread = 1, everything works as expected. However, if fread is called with nThread > 1, the mclapply hangs. Note, once the call hangs, and Ctrl-C interrupts the execution, rerunning the code will even hang on the first call corresponding to nThread = 1.
The data file is available here: https://drive.google.com/file/d/1ilzZ0UKXc25K1MBQpNgqRMAjHRJ-LZaK/view?usp=sharing
Output of sessionInfo()