fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
619 stars 41 forks source link

Number of threads decreased to 1 after re-entering RStudio Server session #112

Closed renkun-ken closed 6 years ago

renkun-ken commented 6 years ago

I'm using the latest development version of fst and I find it quite mysterious that after re-entering my RStudio session, the number of threads indicated by threads_fst() is changed to 1 from 40.

Steps to reproduce:

  1. Start a new session in RStudio
  2. Run fst::threads_fst() which, on my server, returns 40
  3. Refresh the webpage of RStudio, leading to re-entering the session
  4. Run fst::threads_fst() again and the number of threads becomes 1

My session info:

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.3 parallel_3.4.3 tools_3.4.3    yaml_2.1.15    Rcpp_0.12.14   fst_0.7.3     

I'm using RStudio Server 1.1.383.

MarcusKlik commented 6 years ago

Hi @renkun-ken, thanks for reporting that!

If I understand correctly you have a web interface to RStudio Server and the actual R session is running remotely on the server.

So what happens exactly when you refresh the web-page, the servers starts up a completely new R` session and kills the previous one?

The only way that the number of cores would be set to 1 would be if fst can't detect OpenMP when re-entering. That would be strange but can be tested with:

fst:::hasopenmp()  # TRUE if OpenMP detected
#> [1] TRUE

would you be so kind to test that? The other reason for the number of threads to be set to 1 is when fst thinks it's in a forked session. The logic used there is comparable to that used in the data.table package. Would it be possible to test if data.table has the same problem using:

data.table::getDTthreads()
#> [1] 8

Thanks!

renkun-ken commented 6 years ago

I do some tests with both fst::threads_fst() and data.table::getDTthreads() and it seems that RStudio Server re-entered R session may be a forked one. Here's my test code:

while (TRUE) {
  cat("[", format(Sys.time()), "] fst::threads_fst() = ", fst::threads_fst(), 
    ", data.table::getDTthreads() = ", data.table::getDTthreads(), "\n", sep = "")
  Sys.sleep(1)
}

On 21:58:00 I close the webpage. A while later I re-enter the session and see the logging:


[2017-12-08 21:58:00] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:01] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:02] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:03] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:04] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:05] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:06] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:07] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:08] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:09] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:10] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:11] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:12] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:13] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:14] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:15] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:16] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:17] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:18] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:19] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:20] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:21] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:22] fst::threads_fst() = 40, data.table::getDTthreads() = 40
[2017-12-08 21:58:23] fst::threads_fst() = 1, data.table::getDTthreads() = 1
[2017-12-08 21:58:24] fst::threads_fst() = 1, data.table::getDTthreads() = 1
[2017-12-08 21:58:25] fst::threads_fst() = 1, data.table::getDTthreads() = 1
[2017-12-08 21:58:26] fst::threads_fst() = 1, data.table::getDTthreads() = 1
[2017-12-08 21:58:27] fst::threads_fst() = 1, data.table::getDTthreads() = 1
[2017-12-08 21:58:28] fst::threads_fst() = 1, data.table::getDTthreads() = 1
[2017-12-08 21:58:29] fst::threads_fst() = 1, data.table::getDTthreads() = 1
[2017-12-08 21:58:30] fst::threads_fst() = 1, data.table::getDTthreads() = 1
[2017-12-08 21:58:31] fst::threads_fst() = 1, data.table::getDTthreads() = 1

It's quite clear that the R session is not suspended but the moment I re-enter the session at 21:58:23 I may have entered a forked session so that the threads decreased to 1.

I'm not sure why it behaves in this way. Maybe it's not an issue of fst and data.table but this behavior surely makes it less predictive to use RStudio Server with fork-detecting packages. I'll consider raising issues on both data.table and RStudio.

MarcusKlik commented 6 years ago

Hi @renkun-ken, that's a smart way of testing that, nice work!

In data.table's code, there is an explanation why an OpenMP should not switch back to multi-threaded mode after parallel's fork has completed (that causes problems on the Intel compiler), so it is left to the user to switch to more threads again. I followed that advice for fst, so therefore we can't really determine from your experiment whether the fork was very brief (perhaps only to facilitate entering) or stays also after the re-entering.

I could add some code to check that or make it the user's choice to switch back to multi-threaded mode after the fork was ended, say:

fst::threads_fst(8, reset_after_fork = TRUE)
#> [1] 8

That would be an option at the users own risk however :-)

renkun-ken commented 6 years ago

Thanks for referring to the data.table's code and clarify. I'd prefer not making it more complex. I'll use threads_fst() before calling fst functions if I want multi-threading at the moment.

renkun-ken commented 6 years ago

After some intensive use, I prefer adding threads= to both read_fst and write_fst becase it's too easy to let threads fall back to 1 using RStudio Server or calling any mclapply. @MarcusKlik what do you think?

MarcusKlik commented 6 years ago

Hi @renkun-ken, thanks, yes that would be better than setting with fst::threads_fst every time before you call fst::write_fst. Especially because fst also switches back to single threaded mode after some other code or package produces a fork (the user might not even notice as with the RStudio server setup).

Judging from the data.table issues, we have to switch back to prevent problems in some cases. Perhaps a dual option would be most useful, so when the user does:

# set number of threads to 10
fst::write_fst(dt, "myfile.fst", theads = 10)

that amount of threads is set regardless of any other setting. And with:

fst::write_fst(dt, "myfile.fst")

the default thread behavior is used. That default can be set with:

fst::threads_fst(8, single_threaded_on_fork = TRUE, reset_after_fork = FALSE)

That specifies the threading during and after a fork. Would that be a good option?

thanks

renkun-ken commented 6 years ago

@MarcusKlik, it is definitely a good option. Thanks!

MarcusKlik commented 6 years ago

Hi @renkun-ken, with the latest dev version, the default behavior of fst after a fork can now be set with parameter reset_after_fork in threads_fst(). When reset_after_fork = TRUE, the number of threads will be restored to the number of active threads before the fork.

On the data.table repository, some problems have been reported with the Intel compiler when threads are restored after a fork. For those cases, reset_after_fork = FALSE can be used or the fst_restore_after_fork option can be set to FALSE.

I'm very interested to see if this solves your issues with RStudio Server as well!

Thanks

MarcusKlik commented 6 years ago

Hi @renkun-ken, I believe we can close this issue, the default behavior of fst is now to restore the number of threads to the original setting after a fork has ended.

Please let me know if re-entering a RStudio session still disables multi threading and I'll re-open.

thanks for testing and submitting the issue to RStudio!