HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
946 stars 82 forks source link

plan(multisession/cluster) hangs indefinetively #728

Closed tmspvn closed 2 days ago

tmspvn commented 1 week ago

When calling plan() with multisession or cluster hangs without returning errors. Sequential works. This start happening after i update the packages, so I did a fresh install of R, Rstudio and all the packages but it didn't solve the problem. I'm using ubuntu jammy

A reproducible example using R code.

library(future)
library(doFuture)
plan(cluster) # or multisession
results <- foreach(i = 1:10, .combine = c) %dofuture% { i^2}
results

Session information

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fr_CH.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=fr_CH.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=fr_CH.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_CH.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Zurich
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doFuture_1.0.1 foreach_1.5.2  future_1.33.2 

loaded via a namespace (and not attached):
 [1] compiler_4.4.1      parallelly_1.37.1   parallel_4.4.1      tools_4.4.1         rstudioapi_0.16.0   future.apply_1.11.2 listenv_0.9.1      
 [8] codetools_0.2-20    iterators_1.0.14    digest_0.6.35       globals_0.16.3    
…

##########################################################################################
*** Package versions
future 1.33.2, parallelly 1.37.1, parallel 4.4.1, globals 0.16.3, listenv 0.9.1

*** Allocations
availableCores():
system  nproc 
    20     20 
availableWorkers():
$nproc
 [1] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
[13] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"

$system
 [1] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
[13] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"

*** Settings
- future.plan=<not set>
- future.fork.multithreading.enable=<not set>
- future.globals.maxSize=<not set>
- future.globals.onReference=<not set>
- future.resolve.recursive=<not set>
- future.rng.onMisuse=<not set>
- future.wait.timeout=<not set>
- future.wait.interval=<not set>
- future.wait.alpha=<not set>
- future.startup.script=<not set>

*** Backends
Number of workers: 1
List of future strategies:
1. sequential:
   - args: function (..., envir = parent.frame(), workers = "<NULL>")
   - tweaked: FALSE
   - call: plan(sequential)

*** Basic tests
Main R session details:
     pid     r sysname           release                                     version nodename machine   login    user effective_user
1 123813 4.4.1   Linux 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023  host001  x86_64 user001 user001        user001
Worker R session details:
  worker    pid     r sysname           release                                     version nodename machine   login    user effective_user
1      1 123813 4.4.1   Linux 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023  host001  x86_64 user001 user001        user001
Number of unique worker PIDs: 1 (as expected)

…
HenrikBengtsson commented 1 week ago

This is weird. I don't think I've heard of such a problem ever before.

Some quick troubleshooting suggestions:

  1. Does cl <- parallel::makeCluster(2); print(cl); work?
  2. Does cl <- parallelly::makeClusterPSOCK(2); print(cl) work?

The latter is what's used under the hood by multisession and cluster.

Also, make sure to try outside of RStudio, i.e. by running R from the terminal. That could help narrow in on the problem.

FWIW, I'm also on Ubuntu 22.04 running R 4.4.1 with the same locale, and this works just fine for me.

tmspvn commented 1 week ago

Hi,

thanks for the fast answer.

I tried both calls and they hang as well. I called them both from outside rstudio. The second call returned this after a while:

Error in parallelly::makeClusterPSOCK(2) : 
  Cluster setup failed (connectTimeout=125.0 seconds). 2 of 2 workers failed to connect.

Edit: reinstalling parallelly doens't help

HenrikBengtsson commented 1 week ago

I suggest you focus on

cl <- parallel::makeCluster(2) print(cl)

in a vanilla R session. This is a problem unrelated to any R packages you've installed and there's nothing that the futureverse can fix.

I recommend retrying to install R, try another R version, etc. Then reach out to the R-help mailing list for help.

tmspvn commented 1 week ago

Hi,

indeed it's unrelated. I wasn't able to fix it but I figure out that by disabling the wired network connection (the ethernet network) it is able to connect to the workers and everything works fine. When is active it it fails.

Literally, plan(cluster) returns the instant i disable the connection.

Do you have any clue how to solve it or where to look? I will post on the R-help as soon my email gets approved

HenrikBengtsson commented 1 week ago

What does:

> cl <- parallelly::makeClusterPSOCK(1, dryrun = TRUE)

output with and without the Ethernet cable connected?

tmspvn commented 1 week ago

R --vanilla, connection OFF:

'/usr/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11922 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

R --vanilla, connection ON:

'/usr/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11424 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

It changes only the port but it changes every time i run the code

HenrikBengtsson commented 1 week ago

Good. Then I know the two makes no difference. Next, try:

cl <- parallelly::makeClusterPSOCK(1, master = "127.0.0.1")
print(cl)

With some luck, that one won't stall.

tmspvn commented 1 week ago

It returns! Thanks a lot!

Is there a way to prevent this? should I pass master = "127.0.0.1" to plan?

HenrikBengtsson commented 1 week ago

Great. That suggests that Linux interpret hostname localhost differently. What does:

$ grep -vE "^#" /etc/resolv.conf 
$ nslookup localhost

output with and without the Ethernet cable?

... should I pass master = "127.0.0.1" to plan?

I don't think that'll work, but I think the following will do the trick:

options(parallelly.localhost.hostname = "127.0.0.1")

However, that will only solve it for this case; you'll run into other problems like not being able to launch things like Shiny because your Ubuntu setup is somehow messed up.

tmspvn commented 1 week ago

With the Ethernet ON (i need to censor the address because i am on a sensitive network, which could be the cause of it):

$ grep -vE "^#" /etc/resolv.conf 
nameserver 127.0.0.53
options edns0 trust-ad
search intranet.xxxx
$ nslookup localhost
Server:     127.0.0.53
Address:    127.0.0.53#53

Non-authoritative answer:
Name:   localhost.intranet.xxxx
Address: xx.xxx.xxx.xxx

With the Ethernet OFF

$ grep -vE "^#" /etc/resolv.conf 

nameserver 127.0.0.53
options edns0 trust-ad
search .
$ nslookup localhost
Server:     127.0.0.53
Address:    127.0.0.53#53

Name:   localhost
Address: 127.0.0.1
Name:   localhost
Address: ::1
HenrikBengtsson commented 1 week ago

What does:

$ cat /etc/hosts

show in the two cases?

tmspvn commented 1 week ago

in both cases it returns nothing

HenrikBengtsson commented 1 week ago

in both cases it returns nothing

Ah. At a minimum, I'd expect it to have the following near the top:

127.0.0.1   localhost

That will tell the computer that hostname localhost should map to IP number 127.0.0.1. That is a well defined standard. Without it, it'll resort to the search rule in /etc/resolv.conf, which is what appends intranet.xxxx, which is not found (at all).

Do you have admin rights? If so, call:

$ sudo printf "127.0.0.1\tlocalhost\n" >> /etc/hosts

If you have a local sysadm, you should bring this up with them, because it looks like a misconfiguration.

tmspvn commented 2 days ago

If you have a local sysadm, you should bring this up with them, because it looks like a misconfiguration.

This is the case for me. I will report it to them, thanks a lot for the help!