HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
949 stars 83 forks source link

cannot wait for child xxx as it does not exist #218

Closed brainprint closed 5 years ago

brainprint commented 6 years ago

Hi,

First of all, thank you for the package.

After updating to R 3.5.0, the same code, on the same machine (MAC - masOS Sierra), with no other changes, started providing warnings like: cannot wait for child 15641 as it does not exist. There is a Monte Carlo simulation and I am calculating the net present value and internal rate of return for each cashflow.

The amount of warnings are close to 50 no matter if for 100 or 10k simulations.

Best Regards, Rogério Normand.

PS: As the original code contains classified info, I messed up with the values/metrics/results , but the logic remains intact. Please, keep to code confidential.

HenrikBengtsson commented 6 years ago

Please, keep to code confidential.

FYI, this is a public GitHub fourm; I've removed your attached R code for you.

brainprint commented 6 years ago

Thank you. I hope you can reproduce the issue. When I have removed the %<-% it worked without warnings.

HenrikBengtsson commented 6 years ago

Hi. Your code is very very long - you need to come up with a much smaller (= minimal) reproducible example and explain what type of troubleshooting you've attempted to try to narrow this issue down. You're the first one reporting these type of issues.

brainprint commented 6 years ago

Hi. I was afraid to change the code and loose the issue. I have tried with a simpler code, but I was unable to repeat it.

As I remove some parts with %<-% from the original code, the number of warnings reduces.

Warning messages: 1: In selectChildren(pids[!fin], -1) : cannot wait for child 1724 as it does not exist 2: In selectChildren(pids[!fin], -1) : cannot wait for child 1724 as it does not exist

Please, don't worry about my case, because it seems to be running correctly despite the warnings.

I just reported it because the unique change was the new R version, from yesterday.

Thank you for your support.

HenrikBengtsson commented 6 years ago

I see. Thanks for clarifying that those warnings seem harmless. I'll keep the issue open for a while in case other macOS users start to see these as well.

rps13 commented 6 years ago

I get the same warnings on macOS 10.13.4 and R 3.5.0 using the following test code:

library(future)
plan(multiprocess)

testList <- vector(mode = "list", length = 10)
for (i in c(1:length(testList))) {
  testList[[i]] <- future({i * 4})
}

testList <- resolve(testList)
testList <- values(testList)

Once the loop completes there are 50 warnings. I also get the same warnings when using resolve and values in this case. However, when using plan(multisession) there are no errors so there may be something related to forking and multicore. Despite the warnings, the result is the same as when running the same loop single-threaded.

HenrikBengtsson commented 6 years ago

I don't have access to macOS, so I need your help to troubleshoot. From the warning details by @brainprint, the warning appears to come from the parallel package, so I believe this is independent of the future package. If you run the following in a fresh R session:

jobs <- lapply(1:10, FUN = parallel::mcparallel)
values <- parallel::mccollect(jobs)
unlist(values)

I'd expect that you'd also get those warnings - is that the case?

HenrikBengtsson commented 6 years ago

Quick comment: the warnings on "cannot wait for child %d as it does not exist" were indeed only introduced in R (>= 3.5.0), cf. https://github.com/wch/r-source/commit/eb468006b82d96917db88e2310286b54a27b47b7#diff-227a0fc52be87760fb0ed6bdc16527f4R781

rps13 commented 6 years ago

Interestingly the little test you posted above produces no errors or warnings when run (see attached output from R CMD BATCH). future_test.txt

EDIT: Including output here /HB:

R version 3.5.0 (2018-04-23) -- "Joy in Playing"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.5.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> jobs <- lapply(1:10, FUN = parallel::mcparallel)
> values <- parallel::mccollect(jobs)
> unlist(values)
3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 
   1    2    3    4    5    6    7    8    9   10 
> warnings()
> 
> proc.time()
   user  system elapsed 
  0.322   0.073   0.340 
HenrikBengtsson commented 6 years ago

Thanks. It could be that it needs to be hit harder with more tasks, or it could be something I do in the future package. I'll add it to the list of things to investigate.

brainprint commented 6 years ago

Same here, no warnings.

12474 12475 12476 12477 12478 12479 12480 12481 12482 12483 1 2 3 4 5 6 7 8 9 10

brainprint commented 6 years ago

And from @rps13 example, the summary is:

summary(warnings()) Summary of (a total of 50) warning messages: 1x : In selectChildren(job, timeout = timeout) : cannot wait for child 12555 as it does not exist 2x : In selectChildren(job, timeout = timeout) : cannot wait for child 12554 as it does not exist 3x : In selectChildren(job, timeout = timeout) : cannot wait for child 12553 as it does not exist 3x : In selectChildren(pids[!fin], -1) : cannot wait for child 12555 as it does not exist 3x : In selectChildren(pids[!fin], -1) : cannot wait for child 12554 as it does not exist 3x : In selectChildren(pids[!fin], -1) : cannot wait for child 12553 as it does not exist 2x : In selectChildren(job, timeout = timeout) : cannot wait for child 12556 as it does not exist 1x : In selectChildren(pids[!fin], -1) : cannot wait for child 12556 as it does not exist 2x : In selectChildren(job, timeout = timeout) : cannot wait for child 12557 as it does not exist 4x : In selectChildren(pids[!fin], -1) : cannot wait for child 12557 as it does not exist 6x : In selectChildren(job, timeout = timeout) : cannot wait for child 12558 as it does not exist 6x : In selectChildren(pids[!fin], -1) : cannot wait for child 12558 as it does not exist 6x : In selectChildren(job, timeout = timeout) : cannot wait for child 12559 as it does not exist 4x : In selectChildren(pids[!fin], -1) : cannot wait for child 12559 as it does not exist 3x : In selectChildren(job, timeout = timeout) : cannot wait for child 12560 as it does not exist 1x : In selectChildren(pids[!fin], -1) : cannot wait for child 12560 as it does not exist

HenrikBengtsson commented 6 years ago

Thxs.

Ah... not just macOS; by coincident I just stumbled upon this on a Linux cluster. Here's a minimal example that I can work with:

> library("future")
> plan(multicore, workers = 2L)
> fs <- lapply(1:2L, FUN = future)
> values(fs)
[[1]]
[1] 1

[[2]]
[1] 2

Warning messages:
1: In selectChildren(job, timeout = timeout) :   #<== produced by future::resolved.MulticoreFuture()
  cannot wait for child 362508 as it does not exist
2: In selectChildren(job, timeout = timeout) :   #<== produced by future::resolved.MulticoreFuture()
  cannot wait for child 362506 as it does not exist
3: In selectChildren(pids[!fin], -1) :   #<== produced by parallel::mccollect()
  cannot wait for child 362508 as it does not exist
4: In selectChildren(pids[!fin], -1) :   #<== produced by parallel::mccollect()
  cannot wait for child 362508 as it does not exist

This shouldn't happen, so I'll flag this as a bug (which probably been there before but only reveals itself in R (>= 3.5.0).

brainprint commented 6 years ago

@HenrikBengtsson using your example

Warning messages:
1: In selectChildren(job, timeout = timeout) :
  cannot wait for child 13206 as it does not exist
2: In selectChildren(job, timeout = timeout) :
  cannot wait for child 13205 as it does not exist
3: In selectChildren(pids[!fin], -1) :
  cannot wait for child 13206 as it does not exist
4: In selectChildren(pids[!fin], -1) :
  cannot wait for child 13206 as it does not exist

And I agree with your assessment ("been there before"). Doubt: is the origin R 3.5.0 or future package?

HenrikBengtsson commented 6 years ago

I'm leaning toward 'future' now - the simplest explanation would be that the future framework polls the workers one time to many also after the results have been already collected and the forked child process is gone. Just a guess for now - I'll try to find time to investigate and fix (or report upstream to R core if that's where the error is). As you've observed, these warnings are harmless - inspecting the R core code confirms that.

rps13 commented 6 years ago

Thanks for looking into this. Judging by the comments in the source of parallel you linked to above, this may not remain a warning forever.

HenrikBengtsson commented 6 years ago

I'm suppressing these warnings for now since they are quite annoying, while acknowledging that the long-term solution is to fully understand what's going on so it can be fixed. I'm going to do a quick future 1.8.1 release, so the long-term fix will come in a later release.

nextpagesoft commented 6 years ago

I know it is a closed issue, but since I was investigating it and have some findings I would like to share them. Please check the following snippet:

# Define job factory
jobFactory <- function() {
  parallel::mcparallel({
    Sys.getpid()
  })
}

# Example 1: trigger warning

job1 <- jobFactory()
parallel::mccollect(job1, wait = FALSE)
# No warnings

job2 <- jobFactory()
parallel::mccollect(job2, wait = FALSE)
# Warning message:
#   In selectChildren(jobs, timeout) :
#   cannot wait for child [pid of job1] as it does not exist

# Restart R session
rstudioapi::restartSession()

# Example 2: no warning, manual kill of processes
job1 <- jobFactory()
parallel::mccollect(job1, wait = FALSE)
parallel:::rmChild(job1)
# No warnings

job2 <- jobFactory()
parallel::mccollect(job2, wait = FALSE)
parallel:::rmChild(job2)
# No warnings

# Restart R session
rstudioapi::restartSession()

# Example 3: no warnings, call mccollect twice

job1 <- jobFactory()
job2 <- jobFactory()

parallel::mccollect(wait = FALSE)
# $`23428`
# [1] 23428
# 
# $`23427`
# [1] 23427
parallel::mccollect(wait = FALSE)
# $`23428`
# NULL
# 
# $`23427`
# NULL

I think the warning in title is triggered by parallel:::selectChildren called by mccollect. In case mccollect is called as non-blocking (wait = FALSE), forked processes are killed only on the second call. I've run this in Ubuntu 18.04, R 3.5.1.

HenrikBengtsson commented 6 years ago

Thanks for this. I'm on a phone now so haven't tried to reproduce but these are useful findings. So, it looks independent of the future package and specific to R and the parallel package. We should report upstreams to get this fixed.

Importantly, can you reproduce this outside of RStudio in a fresh R terminal session?

If so, would you mind reporting this to the R-devel mailing list? Then the R core devels will see it.

HenrikBengtsson commented 6 years ago

FYI, I can reproduce this in a pure R session on Linux;

job1 <- parallel::mcparallel(Sys.getpid())
parallel::mccollect(job1, wait = FALSE)

job2 <- parallel::mcparallel(Sys.getpid())
### $`16223`
### [1] 16223

parallel::mccollect(job2, wait = FALSE)
### Warning in selectChildren(jobs, timeout) :
###   cannot wait for child 16223 as it does not exist
### $`16247`
### [1] 16247

And now, in front a real screen (was on my phone before), I see that the purpose of your comment might have been to suggest that we should fix this in the future package by making sure to call also parallel::rmChild(). I confirm that I see also this:

job1 <- parallel::mcparallel(Sys.getpid())
parallel::mccollect(job2, wait = FALSE)
### $`16441`
### [1] 16441

parallel:::rmChild(job1)
### [1] FALSE

job2 <- parallel::mcparallel(Sys.getpid())
parallel::mccollect(job2, wait = FALSE)
### $`16444`
### [1] 1444
parallel:::rmChild(job2)
### [1] TRUE

I'll try to add this ...

HenrikBengtsson commented 6 years ago

The following - "Fix uninitialized variable in a cleanup mark (parallel/fork)" - was just committed to R-devel /src/library/parallel/src/fork.c:

index 3fe779474d..d2c6788b0f 100644
--- a/src/library/parallel/src/fork.c
+++ b/src/library/parallel/src/fork.c
[...]
@@ -288,6 +288,8 @@ SEXP mc_prepare_cleanup()
     ci->waitedfor = 1;
     ci->detached = 1;
     ci->pid = -1; /* a cleanup mark */
+    ci->pfd = -1;
+    ci->sifd = -1; /* set fds to -1 to simplify close */
     ci->ppid = getpid();
     ci->next = children;
     children = ci;

Not sure, but it could be related to this issue.

HenrikBengtsson commented 5 years ago

UPDATE: It looks like the underlying issue has been fixed R devel rev75467 - "Fix mc_select_children warning about non-existent children to wait for".

The problem is still there in R 3.5.1 patched:

$ R
R version 3.5.1 Patched (2018-10-20 r75479) -- "Feather Spray"
[...]
> job1 <- parallel::mcparallel(Sys.getpid())
> parallel::mccollect(job1, wait = FALSE)
$`287758`
[1] 287758

> job2 <- parallel::mcparallel(Sys.getpid())
> parallel::mccollect(job2, wait = FALSE)
$`288075`
[1] 288075

Warning message:
In selectChildren(jobs, timeout) :
  cannot wait for child 287758 as it does not exist

but is indeed fixed in R devel:

$ R
R Under development (unstable) (2018-10-21 r75476) -- "Unsuffered Consequences"
[...]

> job1 <- parallel::mcparallel(Sys.getpid())
> parallel::mccollect(job1, wait = FALSE)
$`289242`
[1] 289242

> job2 <- parallel::mcparallel(Sys.getpid()) 
> parallel::mccollect(job2, wait = FALSE)
NULL
## wait a bit longer ...
> parallel::mccollect(job2, wait = FALSE)
$`328590`
[1] 328590

It's only if we call it again after already having collected the value that we get the warning:

> parallel::mccollect(job2, wait = FALSE)
NULL
Warning message:
In selectChildren(jobs, timeout) :
  cannot wait for child 328590 as it does not exist
HenrikBengtsson commented 5 years ago

I can also confirm that future 1.8.0, which is the last version before the package suppress those warning manually, which produces the warning when running in R 3.5.1 patched (and before):

> library(future); plan(multicore, workers = 2L); fs <- lapply(1:2, FUN = future); values(fs)
[[1]]
[1] 1

[[2]]
[1] 2

Warning messages:
1: In selectChildren(job, timeout = timeout) :
  cannot wait for child 375577 as it does not exist
2: In selectChildren(job, timeout = timeout) :
  cannot wait for child 375576 as it does not exist
3: In selectChildren(pids[!fin], -1) :
  cannot wait for child 375577 as it does not exist
4: In selectChildren(pids[!fin], -1) :
  cannot wait for child 375577 as it does not exist

but not when running R-devel ("3.6.0"), e.g.

> library(future); plan(multicore, workers = 2L); fs <- lapply(1:2, FUN = future); values(fs)
[[1]]
[1] 1

[[2]]
[1] 2

From this I conclude we can drop the suppressWarnings() that was introduced in future 1.8.1 in R (>= 3.6.0).

HenrikBengtsson commented 5 years ago

This has now also been fixed in R 3.5.1 patched, which means they will not appear in R 3.5.2 (if that is ever released). I can confirm that I don't see those warning using R version 3.5.1 Patched (2018-11-06 r75555) and future 1.8.0.

I've updated the develop code to supress warnings only when running R 3.5.0 and R 3.5.1. I ignore older version of R 3.5.1 patched, so running the develop version of future there will produce those warnings.