It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
276 stars 21 forks source link

HyperQueue submission gets stuck if workers fail #764

Open svatosFZU opened 2 weeks ago

svatosFZU commented 2 weeks ago

For some time, I am observing this situation. HyperQueue works fine for some time and then something happens. Jobs are coming to the HyperQueue, are buffered there but no workers are submitted even though the allocation queue has backlog set to 10. The situation from this morning:

Kobzol commented 5 days ago

Hi, sorry for the late reply, I'm quite busy at the moment. The automatic allocator has an internal rate limiter that can cause allocations to be stopped if too many failures occur. I'm not sure if that's what is causing this issue though. Are you still using HQ_AUTOALLOC_MAX_ALLOCATION_FAILS?

svatosFZU commented 5 days ago

Yes, the HQ server is started by

RUST_LOG=hyperqueue=debug HQ_AUTOALLOC_MAX_ALLOCATION_FAILS=100000 /home/svatosm/hq-v0.19.0-linux-x64/hq server start --journal /home/svatosm/hqJournal 2> /home/svatosm/hq-debug-output.log &