Open chkabir opened 3 weeks ago
Hi, thanks for the report. It looks like you have cut out the most important part of the stack trace though (you only sent the lines starting at stack frame #97) :) Could you please include the whole stack trace? Thanks!
Sorry about that: this is the whole stack thread
thread 'main' panicked at crates/tako/src/internal/server/worker.rs:126:9:
assertion failed: self.sn_tasks.remove(&task.id)
stack backtrace:
0: 0x557345f0abf9 - std::backtrace_rs::backtrace::libunwind::trace::hbee8a7973eeb6c93
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
1: 0x557345f0abf9 - std::backtrace_rs::backtrace::trace_unsynchronized::hc8ac75eea3aa6899
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x557345f0abf9 - std::sys_common::backtrace::_print_fmt::hc7f3e3b5298b1083
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:68:5
3: 0x557345f0abf9 -
HyperQueue version: v0.19.0
You can also re-run HyperQueue server (and its workers) with the RUST_LOG=hq=debug,tako=debug
environment variable, and attach the logs to the issue, to provide us more information.
Oops, that looks like some race condition, we will take a look.
If you can reproduce the error, could you please run the server with the following environment variable: RUST_LOG=hq=debug,tako=debug hq server start
and then sends us the full debug log if it crashes again? It would help us to debug it.
It would be also great to know how do you create workers (manually/autoalloc?) and what hq submit
commands are you using.
@chkabir Were you able to reproduce the issue and/or run HQ with more logging? :)
Hi,
I was running an hq server at the Oven node at metacentrum.cz. The oven node is supposed to be explicitly designed to let processes run for long times, and even after their walltime. However, for the last instances the Hq server keeps crashing. Below I attach the relevant statements from the log file:
97: 0x557345bba500 - main 98: 0x154fcfff624a - libc_start_call_main at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16 99: 0x154fcfff6305 - libc_start_main_impl at ./csu/../csu/libc-start.c:360:3 100: 0x557345ad4049 -
101: 0x0 -
Oops, HyperQueue has crashed. This is a bug, sorry for that.
If you would be so kind, please report this issue at the HQ issue tracker: https://github.com/It4innovations/hyperqueue/issues/new?title=HQ%20crashes
Please include the above error (starting from "thread ... panicked ...") and the stack backtrace in the issue contents, along with the following information:
HyperQueue version: v0.19.0
You can also re-run HyperQueue server (and its workers) with the
RUST_LOG=hq=debug,tako=debug
environment variable, and attach the logs to the issue, to provide us more information.Can you kindly look into this error ?