It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
269 stars 20 forks source link

allocation files control #636

Closed svatosFZU closed 8 months ago

svatosFZU commented 8 months ago

Hi, I am having problems with oversized allocation files which basically stop the HyperQueue from functioning. I am using --worker-start-cmd and --worker-stop-cmd options to run scripts that mount and unmount software repository. Sometimes, something goes wrong on the worker node which causes stderr having O(100M) lines about "Read-only file system". This produces O(10G) stderr:

du -h .hq-server/hq-current/autoalloc/1/009/stderr 
22G .hq-server/hq-current/autoalloc/1/009/stderr

This fills the home and stops everything there from running. As I cannot control what is happening at the worker node, I need the HyperQueue to be able to handle this. So, I would like to have two things: 1.) A way to limit size of the allocation files. I need to be able to tell HyperQueue to stop filling stderr at a limit (maybe file size or number of lines, whatever is easier). I guess different users would have different requirements, i.e. the limit should be configurable. 2.) When something like this happens, the job is doomed. So, I would like to be able to stop/kill it (batch job, hq worker, whatever)

Kobzol commented 8 months ago

Hi, do you think that the spamming messages with "Read-only file system" are coming from the HQ worker, or from your application/start/stop commands?

If HQ worker repeatedly does something that fails, and it does not stop doing it, then that's a HQ bug, and we should look into that. If the output is caused by your code, then indeed HQ could help by sanity checking that the output is not too large.

It's not that simple, though. HQ could probably be somehow watching stdout/stderr, but I'm not sure if we can do it without compromising performance (it could be opt-in though, that should resolve it). However, if the application produces a gazzilion bytes using some other means (e.g. by directly opening a file), then HQ will not be able to prevent that (unless we will somehow run the task in some sandbox, which will probably not be possible on an HPC cluster).

What would you like to do in case the stdout/stderr fills up? Just stop writing data to it and close the corresponding stream of the task? Or fail the job outright?

Regarding 2), you can stop HQ jobs with hq job cancel <job-id>, and stop HQ workers with hq worker stop <worker-id>.

svatosFZU commented 8 months ago

The messages are coming from execution of start/stop scripts. The thing is that it is too late after it fails. To avoid this, something needs to happen while the script is still running. I agree that a general solution of situation when an application produces a lot of data would be too complex. So, maybe having something just for those logs would be enough. Regarding the performance, doing something like this every 10 seconds:

find .hq-server/hq-current/autoalloc/1/ -type f -size +1M

,when there are jobs running, would be enough and it should not be that resource intensive as the content of the autoalloc dirs tend to be only tens to hundreds dirs. HQ knows the path and the size could be settable. Then what to do with the problematic messages. One thing would be to stop writing to the log, that is clear. What to do next is the question. If that would fail the job then it would simplify things for me (which I mention in the point 2.) but I don't know how other users would like it. I guess at that point, there could be something like worker-logoverflow-cmd script which could kill the job, worker, do nothing, etc. Then people could choose. Regarding 2.), yes, I can use those commands but I need to some trigger or hook that would use them. How would I execute hq command while worker-start-cmd/worker-stop-cmd is being executed? Although, if the log size would be limited then HQ would survive and I could execute it at the end of worker-stop-cmd. I did not think of that before.

Kobzol commented 8 months ago

The messages are coming from execution of start/stop scripts

Hmm, I'm not sure if we can do a lot with that. The start script is executed before a worker is even started, so if some problems happen during the start script, it's outside of scope of HQ.

The start script can be more or less arbitrarily complex bash script though (or invocation of any program on a shared filesystem), so some logic could be embedded into it to watch the output size.

svatosFZU commented 8 months ago

Right. I guess this would need to done on the level of HQ directory rather than per worker. Could you elaborate on logic to embed into the start script to watch the size? As far as I understand, it would basically need two parallel processes, one for performing the command from the start script and other one that checks the log size. Can worker-start-cmd do that? I have one more question related to that. If the script would delete stderr, what happens next? Will HQ stop writing the log? Or will it recreate it and starts filling the space again?

Kobzol commented 8 months ago

Could you elaborate on logic to embed into the start script to watch the size? As far as I understand, it would basically need two parallel processes, one for performing the command from the start script and other one that checks the log size. Can worker-start-cmd do that?

Well, it can do anything, in theory. You can use something like --worker-start-cmd 'python3 /home/svatos/complex-python-script.py' and write a script or a program that will do what you want.

Will HQ stop writing the log? Or will it recreate it and starts filling the space again?

It's important to note that the start command runs completely outside of the context of HyperQueue, before the worker is even started. Any stdout/stderr that you see from it is actually generated in terms of stdout/stderr of the PBS/Slurm job! This output is not handled by HQ in any way.

svatosFZU commented 8 months ago

OK, if this is a problem outside of HyperQueue then I guess this issue can be closed. I will see what I can do there.