JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
29 stars 11 forks source link

SIGTERM test leaks stderr interrupt trace #15

Closed PallHaraldsson closed 7 months ago

PallHaraldsson commented 9 months ago

It's most certainly unrelated to my error, but I noticed this and no worker 28 apparently:

https://buildkite.com/julialang/julia-master/builds/31038#018c515e-2fcc-4fd3-bc6b-59debe1c2e34

      From worker 28:   [20881] signal 15: Terminated
      From worker 28:   in expression starting at none:0
      From worker 28:   epoll_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 28:   uv__io_poll at /workspace/srcdir/libuv/src/unix/epoll.c:236
      From worker 28:   uv_run at /workspace/srcdir/libuv/src/unix/core.c:400
      From worker 28:   ijl_task_get_next at /cache/build/tester-amdci5-10/julialang/julia-master/src/partr.c:477
      From worker 28:   poptask at ./task.jl:989
      From worker 28:   wait at ./task.jl:998
      From worker 28:   task_done_hook at ./task.jl:678
      From worker 28:   jfptr_task_done_hook_58822.1 at /cache/build/tester-amdci5-9/julialang/julia-master/julia-e52146150b/lib/julia/sys.so (unknown line)
      From worker 28:   _jl_invoke at /cache/build/tester-amdci5-10/julialang/julia-master/src/gf.c:2906 [inlined]
      From worker 28:   ijl_apply_generic at /cache/build/tester-amdci5-10/julialang/julia-master/src/gf.c:3088
      From worker 28:   jl_apply at /cache/build/tester-amdci5-10/julialang/julia-master/src/julia.h:2139 [inlined]
      From worker 28:   jl_finish_task at /cache/build/tester-amdci5-10/julialang/julia-master/src/task.c:327
      From worker 28:   start_task at /cache/build/tester-amdci5-10/julialang/julia-master/src/task.c:1317
      From worker 28:   unknown function (ip: (nil))
      From worker 28:   Allocations: 3532691 (Pool: 3532545; Big: 146); GC: 5
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /cache/build/tester-amdci5-9/julialang/julia-master/julia-e52146150b/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1049

I searched for "(28)" and just in case through all 28, so it seems strange. If this is a known problem or should be ignored since just a Warning you can close.

FYI, I also see (likely not a problem): ambiguous (11) | started at 2023-12-10T01:59:11.093 [..] From worker 11: Skipping Base.cwstring

IanButterworth commented 7 months ago

Actually, this was https://github.com/JuliaLang/Distributed.jl/pull/93


I can't understand why we're seeing

Warning: rmprocs: process 1 not removed

because it's coming from

https://github.com/JuliaLang/Distributed.jl/blob/8c033056f0be197060dad7ae39d4a2f7e2d5404f/src/cluster.jl#L1299

https://github.com/JuliaLang/Distributed.jl/blob/8c033056f0be197060dad7ae39d4a2f7e2d5404f/src/cluster.jl#L1246-L1251

https://github.com/JuliaLang/Distributed.jl/blob/8c033056f0be197060dad7ae39d4a2f7e2d5404f/src/cluster.jl#L982-L989

https://github.com/JuliaLang/Distributed.jl/blob/8c033056f0be197060dad7ae39d4a2f7e2d5404f/src/cluster.jl#L1043-L1049

And I don't understand how it's possible for nprocs() > 1 and workers() to contain 1

IanButterworth commented 7 months ago

I dont think this is happening because of https://github.com/JuliaLang/Distributed.jl/blob/8c033056f0be197060dad7ae39d4a2f7e2d5404f/src/cluster.jl#L1299 because we don't see this log https://github.com/JuliaLang/Distributed.jl/blob/2b23ae478f07e4b347306a351bc6ea7d58789919/src/cluster.jl#L1253

IanButterworth commented 7 months ago

I believe this is coming from this test which swallows the log https://github.com/JuliaLang/Distributed.jl/blob/2b23ae478f07e4b347306a351bc6ea7d58789919/test/distributed_exec.jl#L1919-L1925