JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.4k stars 5.46k forks source link

Assertion failure in Scheduler code #55235

Closed d-netto closed 3 weeks ago

d-netto commented 1 month ago

See https://buildkite.com/julialang/julia-master/builds/38431#0190e57f-77e8-461e-afd1-be9abc0297f8:

[556] signal 6 (-6): Aborted
in expression starting at none:1
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f3c58f4f40e)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
ijl_task_get_next at /cache/build/builder-amdci5-0/julialang/julia-master/src/scheduler.c:452
poptask at ./task.jl:1168
wait at ./task.jl:1177
uv_write at ./stream.jl:1073
unsafe_write at ./stream.jl:1146
write at ./strings/io.jl:248 [inlined]
print at ./strings/io.jl:250
unknown function (ip: 0x7f3bbbc89126)
_jl_invoke at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3177 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3354
showerror at ./errorshow.jl:152
unknown function (ip: 0x7f3bbbc890b6)
_jl_invoke at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3177 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3354
_atexit at ./initdefs.jl:467
jfptr__atexit_67251.1 at /cache/build/tester-amdci4-10/julialang/julia-master/julia-d00e19822c/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3177 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3354
jl_apply at /cache/build/builder-amdci5-0/julialang/julia-master/src/julia.h:2183 [inlined]
ijl_atexit_hook at /cache/build/builder-amdci5-0/julialang/julia-master/src/init.c:267
jl_exit_thread0_cb at /cache/build/builder-amdci5-0/julialang/julia-master/src/signals-unix.c:508
Allocations: 1 (Pool: 1; Big: 0); GC: 0

This happened in https://github.com/JuliaLang/julia/pull/55233, which is basically a NFC and doesn't change anything in the scheduler, so I think it's unlikely to be related to the PR.

d-netto commented 1 month ago

This test runs inside rr, so there might be a trace uploaded somewhere?

CC: @DilumAluthge who might know.

vtjnash commented 1 month ago

You missed the assertion text in your copy. It is this:

julia: /cache/build/builder-amdci5-0/julialang/julia-master/src/scheduler.c:452: ijl_task_get_next: Assertion `__extension__ ({ __auto_type __atomic_load_ptr = (&ptls->sleep_check_state); __typeof__ (*__atomic_load_ptr) __atomic_load_tmp; __atomic_load (__atomic_load_ptr, &__atomic_load_tmp, (memory_order_relaxed)); __atomic_load_tmp; }) == not_sleeping' failed.

When a signal causes a thread to resume, we need to also force it back into the not_sleeping state and increment nrunning. Similar to #54721, but needs to also happen when the signal response is to terminate the process directly (such as in jl_task_frame_noreturn) and not just when it throws an InterruptException. I am not entirely certain that we can keep the nrunning counter accurate in this case, but it probably shouldn't matter as we should be attempting to tear down the process fairly aggressively and not wait for nrunning to go to zero (though someone could trick it by calling wait() from their atexit hook such that it cannot exit)

d-netto commented 1 month ago

Ah, OK. Thanks for the clarification.

Suspect it's fine to close then?

DilumAluthge commented 1 month ago

This test runs inside rr, so there might be a trace uploaded somewhere?

Yeah, if you follow the link to Buildkite, you can click on the "Artifacts" tab, and then you can download the rr trace.

It might be split across multiple parts that you need to combine back together.

giordano commented 1 month ago

This error is happening with a high rate lately.