JuliaParallel / Dagger.jl

A framework for out-of-core and parallel execution
610 stars 66 forks source link

Adding processes and using eager API produces warnings about workers dying #536

Closed m-fila closed 3 weeks ago

m-fila commented 3 weeks ago

Adding extra processes and scheduling with eager API seems to be producing error and warnings about reschduling do to workers dying. For example, snippet taken from README:

using Distributed; addprocs() # Add one Julia worker per CPU core
using Dagger

# This runs first:
a = Dagger.@spawn rand(100, 100)

# These run in parallel:
b = Dagger.@spawn sum(a)
c = Dagger.@spawn prod(a)

# Finally, this runs:
wait(Dagger.@spawn println("b: ", b, ", c: ", c))

Gives the following error:

      From worker 2:    b: 5061.860461804876, c: 0.0
┌ Warning: Worker 2 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Error: Error assigning workers
│   exception =
│    ProcessExitedException(2)
│    Stacktrace:
│     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
│       @ Distributed ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1093
│     [2] worker_from_id
│       @ ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1090 [inlined]
│     [3] remote_do
│       @ ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:557 [inlined]
│     [4] cleanup_proc(state::Dagger.Sch.ComputeState, p::OSProc, log_sink::TimespanLogging.NoOpLog)
│       @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:408
│     [5] monitor_procs_changed!(ctx::Context, state::Dagger.Sch.ComputeState)
│       @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:890
│     [6] (::Dagger.Sch.var"#100#102"{Context, Dagger.Sch.ComputeState})()
│       @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:508
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:510
┌ Warning: Worker 3 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 12 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 15 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 13 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 17 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 14 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 8 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 11 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 4 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 5 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 10 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 6 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 7 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 16 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 9 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545

The error sometimes is omitted but warnings about workers dying are present. If lazy API is used then there are no warnings or errors The warnings seems to be harmless since they appear only while finishing the job


Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 5700G with Radeon Graphics
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

Dagger: 0.18.11 I couldn't find any duplicates

JamesWrigley commented 3 weeks ago

Could you try on master? I believe this was fixed in #532.

m-fila commented 3 weeks ago

Thank you. I tried master, the error is gone but the warnings are still there

JamesWrigley commented 3 weeks ago

Yeah I think the warnings will have to stay, unless we bring back Dagger.cleanup() for users to explicitly cleanup things. They can be safely ignored though, so I'll close this.

jpsamaroo commented 3 weeks ago

If those warnings are happening during a clean Julia shutdown, then we need to improve our fault tolerance logic to properly detect a clean shutdown and thus not emit these warnings, since they're quite scary to see. @m-fila can you confirm that these occur during a Julia exit?

m-fila commented 3 weeks ago

Yes, I confirm

jpsamaroo commented 3 weeks ago

Ok, then re-opening this issue since we need to properly silence these warnings.

jpsamaroo commented 3 weeks ago

@m-fila can you please validate that https://github.com/JuliaParallel/Dagger.jl/pull/537 makes the warnings go away for you? It works for me locally.

m-fila commented 3 weeks ago

Yes, they are gone with #537. Thanks!

The warnings still appear tho if the workers are removed workers() |> rmprocs

jpsamaroo commented 3 weeks ago

Yeah, that's a separate issue, because in this case Dagger has no idea that it was intentional for the workers to exit (Distributed.jl doesn't communicate this distinction to Dagger). You would need to call Dagger.rmprocs!(Dagger.Sch.eager_context(), workers()) before calling rmprocs to allow Dagger time to properly clean up the workers.