HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
946 stars 82 forks source link

`summarize_size_of_globals` can hang or be very slow with large objects #660

Closed paciorek closed 1 year ago

paciorek commented 1 year ago

This surfaced with use of future_sapply but the issue appears to be in the behavior of creating messages in future:::summarize_size_of_globals, so I am reporting here.

The following either hangs or runs very slowly:

library(future.apply)
nCores <- 3
plan(multicore, workers = nCores)   # same behavior for `multisession`
options(future.globals.maxSize = 1e9)
x <- rnorm(5e7)    # 400 MB object
future_sapply(seq_len(100), function(i, y) mean(y), x)

On a small-memory machine, I see memory increasing over time and I get an OOM. On a large-memory machine, it finishes eventually (it takes about 30x as long), but surprisingly I don't see the increasing memory use. I'm not sure why, but I don't think it's important to the main point here.

The root cause appears to be the creation of msg in line 438 of globals.R. Execution of sQuote(hexpr(exprOrg)) involves conversion of a very large call object (containing the values in x) to a string. Side note: msg of course is not even used if not using the debug flag.

This behavior seems undesirable, though I suppose one might not characterize it as a bug in some ways.

Side note: Using x as an explicit global variable does not trigger the problem, i.e., this behaves fine:

future_sapply(seq_len(100), function(i) mean(x))

Here's the session info. I'll note that with older versions of future (e.g. 1.16.0), the issue does not arise because the messaging is handled somewhat differently.

> sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] future.apply_1.10.0 future_1.29.0       SCF_4.1.0          

loaded via a namespace (and not attached):
[1] compiler_4.2.2    parallelly_1.32.1 tools_4.2.2       parallel_4.2.2   
[5] listenv_0.8.0     codetools_0.2-18  digest_0.6.30     globals_0.16.1   

>  future::futureSessionInfo()
*** Package versions
future 1.29.0, parallelly 1.32.1, parallel 4.2.2, globals 0.16.1, listenv 0.8.0

*** Allocations
availableCores():
        system cgroups.cpuset          nproc 
             4              4              4 
availableWorkers():
$system
[1] "localhost" "localhost" "localhost" "localhost"

*** Settings
- future.plan=<not set>
- future.fork.multithreading.enable=<not set>
- future.globals.maxSize=1e+09
- future.globals.onReference=<not set>
- future.resolve.recursive=<not set>
- future.rng.onMisuse=<not set>
- future.wait.timeout=<not set>
- future.wait.interval=<not set>
- future.wait.alpha=<not set>
- future.startup.script=<not set>

*** Backends
Number of workers: 3
List of future strategies:
1. multicore:
   - args: function (..., workers = 3, envir = parent.frame())
   - tweaked: TRUE
   - call: plan(multicore, workers = nCores)

*** Basic tests
Main R session details:
     pid     r sysname           release
1 238338 4.2.2   Linux 5.15.0-52-generic
                                              version nodename machine   login
1 #58~20.04.1-Ubuntu SMP Thu Oct 13 13:09:46 UTC 2022  host001  x86_64 user001
     user effective_user
1 user001        user001
Worker R session details:
  worker    pid     r sysname           release
1      1 239193 4.2.2   Linux 5.15.0-52-generic
2      2 239194 4.2.2   Linux 5.15.0-52-generic
3      3 239195 4.2.2   Linux 5.15.0-52-generic
                                              version nodename machine   login
1 #58~20.04.1-Ubuntu SMP Thu Oct 13 13:09:46 UTC 2022  host001  x86_64 user001
2 #58~20.04.1-Ubuntu SMP Thu Oct 13 13:09:46 UTC 2022  host001  x86_64 user001
3 #58~20.04.1-Ubuntu SMP Thu Oct 13 13:09:46 UTC 2022  host001  x86_64 user001
     user effective_user
1 user001        user001
2 user001        user001
3 user001        user001
Number of unique worker PIDs: 3 (as expected)

EDIT: I've added a link to the code. /HB 2023-01-21

HenrikBengtsson commented 1 year ago

Thanks for reporting and for the detailed troubleshooting. Yes, I can see how this can happen. I've now updated the internal hexpr() to use limit the deparsing to the first 100 lines, cf. commit daa63f26. That should avoid this problem.

HenrikBengtsson commented 1 year ago

future 1.31.0 fixing this is now on CRAN