JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
23 stars 9 forks source link

intermittent warnings of forcibly interrupting busy workers #70

Open trathi05 opened 3 years ago

trathi05 commented 3 years ago

My code that adds processes using addprocs and subsequently performs parallelization using pmap sometimes terminates with the following warning. This doesn't affect my output of the code in any way, but this warning shows up in the end, esp. with scripts that run for significant amount of time (over an hour at least).

┌ Warning: Forcibly interrupting busy workers
│   exception = rmprocs: pids [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24] not terminated after 5.0 seconds.
â”” @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1234
┌ Warning: rmprocs: process 1 not removed
â”” @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1030

I am not sure if this is a machine related issue or it has something to do with the Distributed package.

tlarock commented 2 years ago

I am getting similar behavior to @trathi05 with some code using Distributed. I call Julia with julia -p 4 --project=. my_script.jl, some functions run independently in parallel using pmap with 4 workers and everything behaves as I expect (including the correct outputs). But when the script terminates I get the following warnings:

┌ Warning: Forcibly interrupting busy workers
│   exception = rmprocs: pids [3] not terminated after 5.0 seconds.
└ @ Distributed ~/julia/usr/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:1253
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed ~/julia/usr/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:1049

One guess (without real evidence) is that once all of my tasks have been assigned, at least one of the workers that is no longer needed is still active for some reason while the other workers finish their tasks, and so it needs to be forcibly terminated because it is "hanging" (for lack of a more precise term/understanding of what might be happening).

Another guess is that process 1 refers to the host process, and for some reason it is not shutting down properly. I'm not sure if that is even possible, since I assume the host process is the one sending the warnings. Potentially relevant here could be that I am calling pmap from inside a function that is defined in the script.

Since it doesn't seem to be causing a problem with my code execution or performance, it is not a big deal at all. However, it is a concerning-looking warning nonetheless, especially if others use my code down the line.

I am running julia 1.9.0-DEV (2022-05-04, Commit 862018b20d) on Mac OS Montery with Apple Silicon.