It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
277 stars 22 forks source link

HyperQueue v0.19 coredumps when loading journal #742

Closed svatosFZU closed 1 month ago

svatosFZU commented 2 months ago

Hi, I have started using HQ version 0.19 to try recovery from journal functionality. I start the HQ server using:

RUST_LOG=hyperqueue=debug HQ_AUTOALLOC_MAX_ALLOCATION_FAILS=100000 /home/svatosm/hq-v0.19.0-linux-x64/hq server start --journal /mnt/nfs19/svatos/hqJournal 2> /home/svatosm/hq-debug-output.log &
[1] 1536880

If the journal is empty this works and HQ starts. But when I tried to use already existing journal, the process ended with

+------------------+--------------------------+
| Server directory | /home/svatosm/.hq-server |
| Server UID       | 3gpMk8                   |
| Client host      | ui3.farm.particle.cz     |
| Client port      | 36257                    |
| Worker host      | ui3.farm.particle.cz     |
| Worker port      | 38491                    |
| Version          | v0.19.0                  |
| Pid              | 1536880                  |
| Start date       | 2024-08-22 10:06:28 UTC  |
+------------------+--------------------------+

[1]+  Aborted                 (core dumped) RUST_LOG=hyperqueue=debug HQ_AUTOALLOC_MAX_ALLOCATION_FAILS=100000 /home/svatosm/hq-v0.19.0-linux-x64/hq server start --journal /mnt/nfs19/svatos/hqJournal 2> /home/svatosm/hq-debug-output.log

More details in the debug log: https://www.fzu.cz/~svatosm/hq-debug-output.log

spirali commented 2 months ago

This error should be fixed as part of changes in #740. It should be merged during few days. The journal format will be changed after this PR, so your old log could not be loaded. If it is valuable for you, we can backport the fix for v0.19.0, your journal file is correct, just the loader contains a bug.

svatosFZU commented 2 months ago

Thanks for the info. I have no problems with getting rid of current jobs. So, let me know when it is in.

spirali commented 2 months ago

The PR were merged into the main. The changes are available in the nightly build.

svatosFZU commented 2 months ago

Thanks. I deleted the old journal file and I put today's nightly in production.

Kobzol commented 1 month ago

Fixed by https://github.com/It4innovations/hyperqueue/pull/740.