It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
265 stars 20 forks source link

"Invalid entry type: 32" in hq log when force removing a submission queue #716

Open ZJaume opened 1 week ago

ZJaume commented 1 week ago

Hi

I had HQ tasks that were failing but could not debug their error messages because the HQ log file was reporting this:

hq log hq-processing.log summary

was giving me this

Invalid entry type: 32

with any of the log commands 'summary', 'show' or 'cat'.

This is how I'm creating the allocation queue:

hq alloc add slurm --name processing \
    --workers-per-alloc 1 --max-worker-count $WORKERS --backlog $WORKERS \
    --idle-timeout 2m --time-limit 30m \
    -- -p debug -A $SBATCH_ACCOUNT \
    --cpus-per-task 128 --ntasks 1 --mem-per-cpu 1750 \
    -o "$SLURM_LOGS_DIR/hq-worker-%x.out" -e "$SLURM_LOGS_DIR/hq-worker-%x.err"

then I'm creating a tsv with all the tasks parameters and submitting an array of tasks like this:

hq submit --each-line $entries \
    --cpus 128 \
    --progress --log=$SLURM_LOGS_DIR/hq-processing.log \
    --max-fails=5 --crash-limit=5 \
    bash 10.processing

After the job is finished and the hq submit command stops waiting, I'm removing the queue with hq alloc remove --force $qid. I noticed that, without --force, the logs were showing correctly and I had the tasks log messages there. But to automatically remove the queue after hq submit finishes, I need to submit less tasks than --max-fails. So, I guess the problem here is force-removing the queue while it has a running job (even if a blocking hq submit --progress has finished and the remaining tasks won't run because the job has reached the --max-fails limit).

Is it possible that this is a bug and I could get the log messages (or at least the ones that have been captured, as it states in the documentation) even if I run hq alloc remove --force?

HyperQueue version: 0.19.0 Slurm version: 22.05.10

Thanks in advance, Jaume

Kobzol commented 1 week ago

It's possible that this is indeed a bug in HQ, the I/O streaming implementation has some deficiencies and there may be edge cases (possibly an interrupted write might be causing this). We will take a look, thanks for reporting the issue.

spirali commented 1 week ago

Hi,

Can you please share with us your hq-processing.log?

ZJaume commented 1 week ago

Sure! But I've made a few changes and now it seems that the error message shown by hq log is different every time (it sometimes even can finish completing the log). Hope these are useful: This one says "Instance already exists": hq-processing.log

and this one says "Invalid entry type 10": hq-processing2.log

spirali commented 4 days ago

Thank you for your logs. The logs are corrupted by an anussal way. Can you please start the server as "RUST_LOG=debug hq server start", trigger the error and send us the output?

It is most probably our bug, but just for sure, is there a chance that some other process of yours may also tries to write to hq-processing.log?

Kobzol commented 4 days ago

It is most probably our bug, but just for sure, is there a chance that some other process of yours may also tries to write to hq-processing.log?

In theory, this could happen if multiple allocations in parallel write to the same file. Is SLURM_LOGS_DIR guaranteed to be a unique directory per Slurm allocation?

spirali commented 4 days ago

It should not be the case, as the server is the only one who write into the file.

Kobzol commented 4 days ago

Sorry, I confused this with another issue where a separate server was started in each allocation. If there is only one server, it should indeed not happen, if the server is able to handle two submits pointing to the same log filename.

spirali commented 4 days ago

Because the file is corrupted at the beginning, my current hypothesis is that when worker is killed, the event is propagated to the server and the stream file is closed, however some event (maybe some pending connection of already dead worker) may already reinitialize the file and overwrite it. I am now looking in the code and do not see a code path how this could happen, but distributed races are hard to find at first sight. We will know more after we see the logs in the debug mode.

ZJaume commented 4 days ago

There you go: https://pastebin.com/x1hjApfH

If the corruption might happen due to multiple processes (hq server) writing to the same log file, this is probably not the case. I'm not sharing the log with anyone else and I'm only running one sever. However, between all test runs that I've done, there might be some of them that an allocation started, a job was sumbitted, then failed and the queue was canceled, all of this while there were still some slurm jobs in CG state from the previous queue. Although the first time I encountered the issue, there was no such overlap.

spirali commented 3 days ago

Thank you for your log. It seems ok.

However, I have found a problem in streaming code today, that may potentially cause this bug. I will try to fix it in the rest of the week and it may also solve this problem.