It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
272 stars 21 forks source link

hq crashes when server running at metacentrum front-end node and workers running at metacentrum compute nodes and on-prem cluster #731

Closed jose-d closed 1 month ago

jose-d commented 1 month ago

I'm trying so submit into hq from metacentrum front-end node. When there are workers running at both my on-prem cluster and metacentrum, I'm getting crash.

(BOOKWORM)jose@tarkil:/home/jose/projects/2024_07_22__distributed_hyperqueue$ hq submit --array 1-30 --stdout=%{JOB_ID}_%{TASK_ID}-%{INSTANCE_ID}.o  --stderr=%{JOB_ID}_%{TASK_ID}-%{INSTANCE_ID}.e --log=log.log hostname
Job submitted successfully, job ID: 14
(BOOKWORM)jose@tarkil:/home/jose/projects/2024_07_22__distributed_hyperqueue$ hq job info 14
+----------------------+----------------------------------------------------------------------------+
| ID                   | 14                                                                         |
| Name                 | hostname                                                                   |
| State                | [########################################]                                 |
|                      | FAILED (8)                                                                 |
|                      | FINISHED (22)                                                              |
| Tasks                | 30; Ids: 1-30                                                              |
| Workers              | nympha18.meta.zcu.cz                                                       |
| Resources            | cpus: 1 compact                                                            |
| Priority             | 0                                                                          |
| Command              | hostname                                                                   |
| Stdout               | 14_%{TASK_ID}-%{INSTANCE_ID}.o                                             |
| Stderr               | 14_%{TASK_ID}-%{INSTANCE_ID}.e                                             |
| Environment          |                                                                            |
| Working directory    | /auto/vestec1-elixir/home/jose/projects/2024_07_22__distributed_hyperqueue |
| Task time limit      | None                                                                       |
| Crash limit          | 5                                                                          |
| Submission date      | 2024-07-22 14:45:52 UTC                                                    |
| Submission directory | /auto/vestec1-elixir/home/jose/projects/2024_07_22__distributed_hyperqueue |
| Makespan             | 684ms                                                                      |
+----------------------+----------------------------------------------------------------------------+
thread 'main' panicked at crates/hyperqueue/src/client/output/cli.rs:1274:9:
assertion failed: !ids.is_empty()
stack backtrace:
   0:     0x5602f42cdbf9 - std::backtrace_rs::backtrace::libunwind::trace::hbee8a7973eeb6c93
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:     0x5602f42cdbf9 - std::backtrace_rs::backtrace::trace_unsynchronized::hc8ac75eea3aa6899
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x5602f42cdbf9 - std::sys_common::backtrace::_print_fmt::hc7f3e3b5298b1083
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x5602f42cdbf9 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hbb235daedd7c6190
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x5602f4018b60 - core::fmt::rt::Argument::fmt::h76c38a80d925a410
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/fmt/rt.rs:142:9
   5:     0x5602f4018b60 - core::fmt::write::h3ed6aeaa977c8e45
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/fmt/mod.rs:1120:17
   6:     0x5602f429687e - std::io::Write::write_fmt::h78b18af5775fedb5
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/io/mod.rs:1810:15
   7:     0x5602f42cfc2e - std::sys_common::backtrace::_print::h5d645a07e0fcfdbb
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x5602f42cfc2e - std::sys_common::backtrace::print::h85035a511aafe7a8
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x5602f42cf4d7 - std::panicking::default_hook::{{closure}}::hcce8cea212785a25
  10:     0x5602f42cf0bf - std::panicking::default_hook::hf5fcb0f213fe709a
                               at /rustc/07dca489ac2d93c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:292:9
  11:     0x5602f3f7ceeb - call<(&core::panic::panic_info::PanicInfo), (dyn core::ops::function::Fn<(&core::panic::panic_info::PanicInfo), Output=()> + core::marker::Send + core::marker::Sync), alloc::alloc::Global>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2029:9
  12:     0x5602f3f7ceeb - {closure#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:360:9
  13:     0x5602f42d021a - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hbc5ccf4eb663e1e5
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2029:9
  14:     0x5602f42d021a - std::panicking::rust_panic_with_hook::h095fccf1dc9379ee
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:783:13
  15:     0x5602f42cff68 - std::panicking::begin_panic_handler::{{closure}}::h032ba12139b353db
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:649:13
  16:     0x5602f42cfef6 - std::sys_common::backtrace::__rust_end_short_backtrace::h9259bc2ff8fd0f76
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:171:18
  17:     0x5602f42cfeef - rust_begin_unwind
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
  18:     0x5602f3e57074 - core::panicking::panic_fmt::h784f20a50eaab275
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
  19:     0x5602f3e57242 - core::panicking::panic::hb837a5ebbbe5b188
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:144:5
  20:     0x5602f40dccc0 - format_workers
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/output/cli.rs:1274:9
  21:     0x5602f40dc73c - {closure#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/output/cli.rs:198:25
  22:     0x5602f40dc73c - call_mut<(&hyperqueue::server::job::JobTaskInfo), hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:294:13
  23:     0x5602f40dc73c - find_map<hyperqueue::server::job::JobTaskInfo, alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, &mut hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/slice/iter/macros.rs:319:38
  24:     0x5602f40dc73c - next<alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/adapters/filter_map.rs:63:9
  25:     0x5602f40dc73c - next<core::iter::adapters::filter_map::FilterMap<core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/adapters/take.rs:41:13
  26:     0x5602f40dc1b0 - from_iter<alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, core::iter::adapters::take::Take<core::iter::adapters::filter_map::FilterMap<core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/vec/spec_from_iter_nested.rs:26:32
  27:     0x5602f40dc1b0 - from_iter<alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, core::iter::adapters::take::Take<core::iter::adapters::filter_map::FilterMap<core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/vec/spec_from_iter.rs:33:9
  28:     0x5602f40dc1b0 - from_iter<alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, core::iter::adapters::take::Take<core::iter::adapters::filter_map::FilterMap<core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/vec/mod.rs:2791:9
  29:     0x5602f40dc1b0 - collect<core::iter::adapters::take::Take<core::iter::adapters::filter_map::FilterMap<core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>>, alloc::vec::Vec<alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, alloc::alloc::Global>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/traits/iterator.rs:2054:9
  30:     0x5602f40dc1b0 - print_task_summary
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/output/cli.rs:205:14
  31:     0x5602f40d91b0 - print_job_detail
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/output/cli.rs:556:13
  32:     0x5602f3f7ae02 - {async_fn#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/commands/job.rs:176:5
  33:     0x5602f3f7ae02 - {async_fn#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:106:63
  34:     0x5602f3f7ae02 - {async_block#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:416:52
  35:     0x5602f3f6283d - poll<&mut hq::main::{async_block_env#0}>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/future/future.rs:124:9
  36:     0x5602f3f6283d - {closure#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:659:57
  37:     0x5602f3f6283d - with_budget<core::task::poll::Poll<core::result::Result<(), hyperqueue::common::error::HqError>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/coop.rs:107:5
  38:     0x5602f3f6283d - budget<core::task::poll::Poll<core::result::Result<(), hyperqueue::common::error::HqError>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/coop.rs:73:5
  39:     0x5602f3f6283d - {closure#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:659:25
  40:     0x5602f3f6283d - enter<core::task::poll::Poll<core::result::Result<(), hyperqueue::common::error::HqError>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:404:19
  41:     0x5602f3f6283d - {closure#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:658:36
  42:     0x5602f3f6283d - {closure#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:737:68
  43:     0x5602f3f6283d - set<tokio::runtime::scheduler::Context, tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>)>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context/scoped.rs:40:9
  44:     0x5602f3f6283d - {closure#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:176:26
  45:     0x5602f3f6283d - try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>)>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/thread/local.rs:270:16
  46:     0x5602f3f6283d - with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>)>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/thread/local.rs:246:9
  47:     0x5602f3f6283d - set_scheduler<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:176:17
  48:     0x5602f3f6283d - enter<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:737:27
  49:     0x5602f3f6283d - block_on<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:646:19
  50:     0x5602f3f6283d - {closure#0}<hq::main::{async_block_env#0}>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:175:28
  51:     0x5602f3f6283d - enter_runtime<tokio::runtime::scheduler::current_thread::{impl#0}::block_on::{closure_env#0}<hq::main::{async_block_env#0}>, core::result::Result<(), hyperqueue::common::error::HqError>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context/runtime.rs:65:16
  52:     0x5602f3f6283d - block_on<hq::main::{async_block_env#0}>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:167:9
  53:     0x5602f3f6283d - block_on<hq::main::{async_block_env#0}>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/runtime.rs:348:47
  54:     0x5602f3f6283d - main
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:456:5
  55:     0x5602f3eee203 - call_once<fn() -> core::result::Result<(), hyperqueue::common::error::HqError>, ()>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250:5
  56:     0x5602f3eee203 - __rust_begin_short_backtrace<fn() -> core::result::Result<(), hyperqueue::common::error::HqError>, core::result::Result<(), hyperqueue::common::error::HqError>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:155:18
  57:     0x5602f3f7d500 - main
  58:     0x7f04dbced24a - __libc_start_call_main
                               at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  59:     0x7f04dbced305 - __libc_start_main_impl
                               at ./csu/../csu/libc-start.c:360:3
  60:     0x5602f3e97049 - <unknown>
  61:                0x0 - <unknown>
Oops, HyperQueue has crashed. This is a bug, sorry for that.
If you would be so kind, please report this issue at the HQ issue tracker: https://github.com/It4innovations/hyperqueue/issues/new?title=HQ%20crashes
Please include the above error (starting from "thread ... panicked ...") and the stack backtrace in the issue contents, along with the following information:

HyperQueue version: v0.19.0

You can also re-run HyperQueue server (and its workers) with the `RUST_LOG=hq=debug,tako=debug`
environment variable, and attach the logs to the issue, to provide us more information.

Aborted (core dumped)
(BOOKWORM)jose@tarkil:/home/jose/projects/2024_07_22__distributed_hyperqueue$

note:

I must admit that Metacentrum uses quite interesting setup, where $HOME differs depending on which sub-cluster one is logged-in / or job is running / so I somehow suspect the crash could be caused by this.. But not sure.

(BOOKWORM)jose@tarkil:/home/jose/projects/2024_07_22__distributed_hyperqueue$ echo $HOME
/storage/praha1/home/jose
(BOOKWORM)jose@tarkil:/home/jose/projects/2024_07_22__distributed_hyperqueue$ pwd
/home/jose/projects/2024_07_22__distributed_hyperqueue
(BOOKWORM)jose@tarkil:/home/jose/projects/2024_07_22__distributed_hyperqueue$

and

$ ll /storage/praha1/home
lrwxrwxrwx 1 root root 25 úno  3  2021 /storage/praha1/home -> /auto/vestec1-elixir/home/
$ 

at my on-prem system, there is no /storage or /auto storage/directory tree, my home is in /home/jose there.

Kobzol commented 1 month ago

Hi, thanks for reporting the issue! HQ sees that a there is a task in a failed state that does not have any workers attached. This can happen, although it's quite rare, so the CLI printing code did not consider this possibility and used an explicit assert instead that this does not happen.

https://github.com/It4innovations/hyperqueue/pull/732 should fix this.

Kobzol commented 1 month ago

The issue should have been fixed now. You can try it in the next nightly release starting from tomorrow, and it will be of course available from HQ 0.20 onwards.

jose-d commented 1 month ago

hi, with nightly-2024-07-21-0f100dfb6fca2daf70dd90fd5b9e2cb3306a72a7 I'm getting:

(BOOKWORM)jose@tarkil:~$ ./tools/hq job info 1 &> hq_job_info_1.txt
Aborted (core dumped)
(BOOKWORM)jose@tarkil:~$
(BOOKWORM)jose@tarkil:~$ cat hq_job_info_1.txt
+----------------------+--------------------------------------------------------+
| ID                   | 1                                                      |
| Name                 | sleep                                                  |
| State                | FAILED                                                 |
| Tasks                | 1; Ids: 0                                              |
| Workers              |                                                        |
| Resources            | cpus: 1 compact                                        |
| Priority             | 0                                                      |
| Command              | sleep                                                  |
|                      | 1                                                      |
| Stdout               | /auto/vestec1-elixir/home/jose/job-1/%{TASK_ID}.stdout |
| Stderr               | /auto/vestec1-elixir/home/jose/job-1/%{TASK_ID}.stderr |
| Environment          |                                                        |
| Working directory    | /auto/vestec1-elixir/home/jose                         |
| Task time limit      | None                                                   |
| Crash limit          | 5                                                      |
| Submission date      | 2024-07-23 09:10:35 UTC                                |
| Submission directory | /auto/vestec1-elixir/home/jose                         |
| Makespan             | 1ms                                                    |
+----------------------+--------------------------------------------------------+
thread 'main' panicked at crates/hyperqueue/src/client/output/cli.rs:1274:9:
assertion failed: !ids.is_empty()
stack backtrace:
   0:     0x558721dfc109 - std::backtrace_rs::backtrace::libunwind::trace::hbee8a7973eeb6c93
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:     0x558721dfc109 - std::backtrace_rs::backtrace::trace_unsynchronized::hc8ac75eea3aa6899
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x558721dfc109 - std::sys_common::backtrace::_print_fmt::hc7f3e3b5298b1083
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x558721dfc109 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hbb235daedd7c6190
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x558721b45d20 - core::fmt::rt::Argument::fmt::h76c38a80d925a410
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/fmt/rt.rs:142:9
   5:     0x558721b45d20 - core::fmt::write::h3ed6aeaa977c8e45
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/fmt/mod.rs:1120:17
   6:     0x558721dc4d8e - std::io::Write::write_fmt::h78b18af5775fedb5
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/io/mod.rs:1810:15
   7:     0x558721dfe13e - std::sys_common::backtrace::_print::h5d645a07e0fcfdbb
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x558721dfe13e - std::sys_common::backtrace::print::h85035a511aafe7a8
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x558721dfd9e7 - std::panicking::default_hook::{{closure}}::hcce8cea212785a25
  10:     0x558721dfd5cf - std::panicking::default_hook::hf5fcb0f213fe709a
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:292:9
  11:     0x558721aadabb - call<(&core::panic::panic_info::PanicInfo), (dyn core::ops::function::Fn<(&core::panic::panic_info::PanicInfo), Output=()> + core::marker::Send + core::marker::Sync), alloc::alloc::Global>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2029:9
  12:     0x558721aadabb - {closure#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:360:9
  13:     0x558721dfe72a - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hbc5ccf4eb663e1e5
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2029:9
  14:     0x558721dfe72a - std::panicking::rust_panic_with_hook::h095fccf1dc9379ee
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:783:13
  15:     0x558721dfe478 - std::panicking::begin_panic_handler::{{closure}}::h032ba12139b353db
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:649:13
  16:     0x558721dfe406 - std::sys_common::backtrace::__rust_end_short_backtrace::h9259bc2ff8fd0f76
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:171:18
  17:     0x558721dfe3ff - rust_begin_unwind
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
  18:     0x558721988444 - core::panicking::panic_fmt::h784f20a50eaab275
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
  19:     0x558721988612 - core::panicking::panic::hb837a5ebbbe5b188
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:144:5
  20:     0x558721c0c830 - format_workers
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/output/cli.rs:1274:9
  21:     0x558721c0c2a4 - {closure#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/output/cli.rs:198:25
  22:     0x558721c0c2a4 - call_mut<(&hyperqueue::server::job::JobTaskInfo), hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:294:13
  23:     0x558721c0c2a4 - find_map<hyperqueue::server::job::JobTaskInfo, alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, &mut hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/slice/iter/macros.rs:319:38
  24:     0x558721c0c2a4 - next<alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/adapters/filter_map.rs:63:9
  25:     0x558721c0c2a4 - next<core::iter::adapters::filter_map::FilterMap<core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/adapters/take.rs:41:13
  26:     0x558721c0bd0c - from_iter<alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, core::iter::adapters::take::Take<core::iter::adapters::filter_map::FilterMap<core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/vec/spec_from_iter_nested.rs:26:32
  27:     0x558721c0bd0c - from_iter<alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, core::iter::adapters::take::Take<core::iter::adapters::filter_map::FilterMap<core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/vec/spec_from_iter.rs:33:9
  28:     0x558721c0bd0c - from_iter<alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, core::iter::adapters::take::Take<core::iter::adapters::filter_map::FilterMap<core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/vec/mod.rs:2791:9
  29:     0x558721c0bd0c - collect<core::iter::adapters::take::Take<core::iter::adapters::filter_map::FilterMap<core::slice::iter::Iter<hyperqueue::server::job::JobTaskInfo>, hyperqueue::client::output::cli::{impl#0}::print_task_summary::{closure_env#0}>>, alloc::vec::Vec<alloc::vec::Vec<cli_table::cell::CellStruct, alloc::alloc::Global>, alloc::alloc::Global>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/traits/iterator.rs:2054:9
  30:     0x558721c0bd0c - print_task_summary
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/output/cli.rs:205:14
  31:     0x558721c08ed0 - print_job_detail
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/output/cli.rs:556:13
  32:     0x558721aab985 - {async_fn#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/commands/job.rs:176:5
  33:     0x558721aab985 - {async_fn#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:106:63
  34:     0x558721aab985 - {async_block#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:416:52
  35:     0x558721a92dfd - poll<&mut hq::main::{async_block_env#0}>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/future/future.rs:124:9
  36:     0x558721a92dfd - {closure#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:659:57
  37:     0x558721a92dfd - with_budget<core::task::poll::Poll<core::result::Result<(), hyperqueue::common::error::HqError>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:107:5
  38:     0x558721a92dfd - budget<core::task::poll::Poll<core::result::Result<(), hyperqueue::common::error::HqError>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:73:5
  39:     0x558721a92dfd - {closure#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:659:25
  40:     0x558721a92dfd - enter<core::task::poll::Poll<core::result::Result<(), hyperqueue::common::error::HqError>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:404:19
  41:     0x558721a92dfd - {closure#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:658:36
  42:     0x558721a92dfd - {closure#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:737:68
  43:     0x558721a92dfd - set<tokio::runtime::scheduler::Context, tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>)>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/scoped.rs:40:9
  44:     0x558721a92dfd - {closure#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context.rs:180:26
  45:     0x558721a92dfd - try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>)>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/thread/local.rs:270:16
  46:     0x558721a92dfd - with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>)>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/thread/local.rs:246:9
  47:     0x558721a92dfd - set_scheduler<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context.rs:180:17
  48:     0x558721a92dfd - enter<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:737:27
  49:     0x558721a92dfd - block_on<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:646:19
  50:     0x558721a92dfd - {closure#0}<hq::main::{async_block_env#0}>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:175:28
  51:     0x558721a92dfd - enter_runtime<tokio::runtime::scheduler::current_thread::{impl#0}::block_on::{closure_env#0}<hq::main::{async_block_env#0}>, core::result::Result<(), hyperqueue::common::error::HqError>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/runtime.rs:65:16
  52:     0x558721a92dfd - block_on<hq::main::{async_block_env#0}>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:167:9
  53:     0x558721a92dfd - block_on<hq::main::{async_block_env#0}>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/runtime.rs:347:47
  54:     0x558721a92dfd - main
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:456:5
  55:     0x558721a19d93 - call_once<fn() -> core::result::Result<(), hyperqueue::common::error::HqError>, ()>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250:5
  56:     0x558721a19d93 - __rust_begin_short_backtrace<fn() -> core::result::Result<(), hyperqueue::common::error::HqError>, core::result::Result<(), hyperqueue::common::error::HqError>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:155:18
  57:     0x558721aae0d0 - main
  58:     0x7f17af01024a - __libc_start_call_main
                               at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  59:     0x7f17af010305 - __libc_start_main_impl
                               at ./csu/../csu/libc-start.c:360:3
  60:     0x5587219c8129 - <unknown>
  61:                0x0 - <unknown>
Oops, HyperQueue has crashed. This is a bug, sorry for that.
If you would be so kind, please report this issue at the HQ issue tracker: https://github.com/It4innovations/hyperqueue/issues/new?title=HQ%20crashes
Please include the above error (starting from "thread ... panicked ...") and the stack backtrace in the issue contents, along with the following information:

HyperQueue version: nightly-2024-07-21-0f100dfb6fca2daf70dd90fd5b9e2cb3306a72a7

You can also re-run HyperQueue server (and its workers) with the `RUST_LOG=hq=debug,tako=debug`
environment variable, and attach the logs to the issue, to provide us more information.
Kobzol commented 1 month ago

Hi, as you can see from your HQ version (2024-07-21), you actually have Monday's nightly, not today's. Unfortunately our nightly release today failed because of some CI network issue. Sorry for that.

I restarted the nightly build manually now, it should be now available at https://github.com/It4innovations/hyperqueue/releases/tag/nightly.

It is interesting that you're encountering this situation so often though. It can be caused by a command failing to start - perhaps a wrong binary path? In any case, with the error fixed, it should be hopefully easier to debug.

jose-d commented 1 month ago

you actually have Monday's nightly

mea culpa. 🤦 I'm deploying Nightly build 2024-07-23 just now.

It is interesting that you're encountering this situation so often though.

yes, it is 100% reproducible when running workers on both systems in single hq session.

Kobzol commented 1 month ago

No need to apologize, it was an error on our side, it's easy to miss the nightly version.

I noticed that the binary is hostname - that probably doesn't cause problems. My guess is that the working directory might be the culprit. It is set to the submission directory by default, as you can see in the job info table. If the working directory path is not available/cannot be created on the node of the worker, it can cause a task to fail.

jose-d commented 1 month ago

Thanks. I can confirm that now it fails correctly with:

(BOOKWORM)jose@tarkil:~$ ./tools/hq job info 2
+----------------------+--------------------------------------------------------+
| ID                   | 2                                                      |
| Name                 | sleep                                                  |
| State                | [########################################]     |
|                      | FAILED (20)                                          |
|                      | FINISHED (10)                                        |
| Tasks                | 30; Ids: 1-30                                          |
| Workers              | luna106.fzu.cz                                         |
| Resources            | cpus: 1 compact                                        |
| Priority             | 0                                                      |
| Command              | sleep                                                  |
|                      | 1                                                      |
| Stdout               | /auto/vestec1-elixir/home/jose/job-2/%{TASK_ID}.stdout |
| Stderr               | /auto/vestec1-elixir/home/jose/job-2/%{TASK_ID}.stderr |
| Environment          |                                                        |
| Working directory    | /auto/vestec1-elixir/home/jose                         |
| Task time limit      | None                                                   |
| Crash limit          | 5                                                      |
| Submission date      | 2024-07-23 09:39:11 UTC                                |
| Submission directory | /auto/vestec1-elixir/home/jose                         |
| Makespan             | 3s 80ms                                                |
+----------------------+--------------------------------------------------------+
+---------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Task ID | Worker | Error                                                                                                                                                                                          |
+---------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 3       |        | Error: Cannot create stdout directory at File { path: "/auto/vestec1-elixir/home/jose/job-2/3.stdout", on_close: None }: Os { code: 13, kind: PermissionDenied, message: "Permission denied" } |
| 4       |        | Error: Cannot create stdout directory at File { path: "/auto/vestec1-elixir/home/jose/job-2/4.stdout", on_close: None }: Os { code: 13, kind: PermissionDenied, message: "Permission denied" } |
| 5       |        | Error: Cannot create stdout directory at File { path: "/auto/vestec1-elixir/home/jose/job-2/5.stdout", on_close: None }: Os { code: 13, kind: PermissionDenied, message: "Permission denied" } |
| 6       |        | Error: Cannot create stdout directory at File { path: "/auto/vestec1-elixir/home/jose/job-2/6.stdout", on_close: None }: Os { code: 13, kind: PermissionDenied, message: "Permission denied" } |
| 9       |        | Error: Cannot create stdout directory at File { path: "/auto/vestec1-elixir/home/jose/job-2/9.stdout", on_close: None }: Os { code: 13, kind: PermissionDenied, message: "Permission denied" } |
+---------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
20 tasks failed. (5 shown)
(BOOKWORM)jose@tarkil:~$

so the bug causing crash is fixed, thanks!

I'll create follow-up discussion covering this behavior in Discussions - as the failure above is - I believe - more caused by architecture of Metacentrum storage than particular bug in hyperqueue.

thanks again for this fix.

josef

Kobzol commented 1 month ago

Thanks for confirming it. You can try to set the working directory to be a relative path, that might help with overcoming the issues with a non-shared filesystem. We can talk about it more in Discussions.