It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
272 stars 21 forks source link

Executing scripts in running allocation before and after hq jobs #600

Closed svatosFZU closed 1 year ago

svatosFZU commented 1 year ago

I have a software repository which I need to mount before any job starts and unmount after the last job finished. I cannot really do it inside of jobs because there can be only one mount process per machine but I ran 4 hq jobs in it. For mounting, in theory I could check with the first job if the process is running and if not then start it. Problem is that I do not have a way to block other hq jobs from starting in the allocation until the mounting is done and they could fail quickly because of inaccessible software. For unmounting, I cannot do that in job because I cannot know which hq job running in the allocation will finish the last. Without unmounting the stale mounts will litter the compute nodes and HPC admins will complain. So, could this mounting/unmounting be done in HyperQueue?

Kobzol commented 1 year ago

Hi, if it was a single job and 4 tasks, you could create one task on which all 4 tasks depend, which will mount it, and then another task that will depend on the 4 tasks, and then unmount it. But if it's indeed 4 jobs, then this cannot be used, as HQ doesn't support dependencies between jobs.

Could you do something like this?

$ mount
$ hq submit ...
$ hq submit ...
...
$ hq job wait 1-4
$ unmount

You could also use something like ./hq submit --output-mode quiet <command>, store the result (job ID) into a bash array and then wait for all the job IDs, separated by commas.

svatosFZU commented 1 year ago

Considering I have 1k+ jobs in HyperQueue all the time, it would probably need some management system. Also, I am using allocation queue (which decides what starts where). Would it work with that?

Kobzol commented 1 year ago

Probably not very well. You mention that you have a lot of jobs in HQ, this means that only some of them require the repository? Otherwise I assume that you would just leave it mounted all the time, when so many jobs execute. What is the "scope" of the mount? Do you need to mount it once per computing node? Or do you just need to mount it when some specific jobs are executing, and once they are finished, then unmount it? Is mounting/unmounting a performance concern, or does it primarily serve to cleanup resources?

So, to sum up, please try to describe the use-case in more detail :)

svatosFZU commented 1 year ago

Sure. I need it for every job, i.e. whatever runs in an allocation needs to have it. So, it needs to be mounted all the time the allocation is running. I have single node allocation, so it is needed per computing node. Regarding the performance, mounting takes few minutes, unmounting is almost instant (it does not use much CPU/RAM).

Kobzol commented 1 year ago

I see. If it's needed for each node, then I suppose that something like this could be helpful for you?

$ hq alloc add pbs ... --init-script "mount-repository"

However, I'm not sure if we can unmount it in a robust way, since PBS can kill the pretty much allocation at any time (especially if you use the "best-effort" allocation queues).

svatosFZU commented 1 year ago

Great, thanks, that would work for mounting. For unmounting, I would not worry about kill related problems. If PBS/Slurm decides to kill the allocation then it is also its responsibility to clean. But when my jobs finish cleanly without interference from batch system then the responsibility to clean after my jobs is on my side.

Kobzol commented 1 year ago

Then I suppose that you would also want something like --stop-script "unmount", right?

svatosFZU commented 1 year ago

Exactly.

Kobzol commented 1 year ago

Ok. This is a reasonable use-case, I was expecting that it could be useful to someone, but it hasn't been implemented yet, because no one needed it. Until now :) I will try to implement it.

svatosFZU commented 1 year ago

I see the issue was closed. So, could I ask when/in which version this becomes available?

Kobzol commented 1 year ago

It should be available in yesterday's nightly release. Could you please check that it works for you? If yes, we will probably release a new version sometime later this week.

svatosFZU commented 1 year ago

Well, I tried it. While it accepts all the commands, it is not submitting any allocations. So, here is what I did:

  1. I downloaded and untarred hq-nightly-2023-06-25-4ec7bbb501b7dc9d7388249d03952694f1ab3993-linux-x64.tar.gz
  2. I started the HQ:
    RUST_LOG=hyperqueue=debug HQ_AUTOALLOC_MAX_ALLOCATION_FAILS=100000 /home/svatosm/nightly6/hq server start 2> hq-debug-output.log &
  3. I started the allocation queue (first I tried without backlog, later I tried with backlog):
    /home/svatosm/nightly6/hq alloc add slurm --backlog 2 --time-limit 12h --worker-start-cmd "/home/svatosm/ACmount.sh" --worker-stop-cmd "/home/svatosm/ACumount.sh" -- -ADD-23-14 -pp02-intel
  4. I submitted a job:
    /home/svatosm/nightly6/hq submit --time-request=12h --crash-limit=1 --cpus=32 /home/svatosm/21031/hq.sh

    It has been over half hour, the hq job is there in waiting state and there are no allocation created. It is also not throwing any errors - the log is at https://www.fzu.cz/~svatosm/hq-debug-output.log

Kobzol commented 1 year ago

Could you please send the contents of /home/svatosm/.hq-server/002/autoalloc/*?

svatosFZU commented 1 year ago

Well, I have stopped the HQ. Now, I stared it again and tried another job. In the current state, there is no autoalloc in 002 or 003 (hq-current). There is only access.json.

Kobzol commented 1 year ago

That's weird, indeed no allocations are being submitted. Does this also happen with a previous (stable) version of HQ? Could you try lowering the --time-request value to e.g. 11h (while keeping --time-limit to 12h).

Kobzol commented 1 year ago

We can also resolve this interactively on https://hyperqueue.zulipchat.com/, if you want.

svatosFZU commented 1 year ago

OK, lowering the time-request seems to help. The job is now running. Does it mean that now these two values must be different? For the zulipchat, I have never heard of it, so I don't even have an account there.

svatosFZU commented 1 year ago

Although, there is some odd behavior. I tried to start two more jobs (to test if the environment will be retained for all of them). As I run 32-core hq job in 32-core allocation, I assumed they would be waiting. But instead, the started running. So, now I have three 32-core hq jobs running in one 32-core batch job.

svatosFZU commented 1 year ago

Regarding the last comment, here is (hopefully) relevant part of .hq-server/hq-current/autoalloc/1/001/stderr

[2023-06-26T12:50:48.590Z DEBUG hyperqueue::worker::start] Starting program launcher task_id=1 res=ResourceRequest { n_nodes: 0, resources: [ResourceRequestEntry { resource_id: ResourceId(0), request: Compact(32) }], min_time: 39600s } alloc=Allocation { nodes: [], resources: [ResourceAllocation { resource: ResourceId(0), value: Indices([ResourceIndex(32), ResourceIndex(33), ResourceIndex(34), ResourceIndex(35), ResourceIndex(36), ResourceIndex(37), ResourceIndex(38), ResourceIndex(39), ResourceIndex(40), ResourceIndex(41), ResourceIndex(42), ResourceIndex(43), ResourceIndex(44), ResourceIndex(45), ResourceIndex(46), ResourceIndex(47), ResourceIndex(48), ResourceIndex(49), ResourceIndex(50), ResourceIndex(51), ResourceIndex(52), ResourceIndex(53), ResourceIndex(54), ResourceIndex(55), ResourceIndex(56), ResourceIndex(57), ResourceIndex(58), ResourceIndex(59), ResourceIndex(60), ResourceIndex(61), ResourceIndex(62), ResourceIndex(63)]) }], counts: ResourceCountVec { counts: IndexVec([[32]], PhantomData<tako::internal::common::resources::ResourceId>) } } body_len=301
[2023-06-26T12:54:52.228Z DEBUG hyperqueue::worker::start] Starting program launcher task_id=2 res=ResourceRequest { n_nodes: 0, resources: [ResourceRequestEntry { resource_id: ResourceId(0), request: Compact(32) }], min_time: 39600s } alloc=Allocation { nodes: [], resources: [ResourceAllocation { resource: ResourceId(0), value: Indices([ResourceIndex(0), ResourceIndex(1), ResourceIndex(2), ResourceIndex(3), ResourceIndex(4), ResourceIndex(5), ResourceIndex(6), ResourceIndex(7), ResourceIndex(8), ResourceIndex(9), ResourceIndex(10), ResourceIndex(11), ResourceIndex(12), ResourceIndex(13), ResourceIndex(14), ResourceIndex(15), ResourceIndex(16), ResourceIndex(17), ResourceIndex(18), ResourceIndex(19), ResourceIndex(20), ResourceIndex(21), ResourceIndex(22), ResourceIndex(23), ResourceIndex(24), ResourceIndex(25), ResourceIndex(26), ResourceIndex(27), ResourceIndex(28), ResourceIndex(29), ResourceIndex(30), ResourceIndex(31)]) }], counts: ResourceCountVec { counts: IndexVec([[32]], PhantomData<tako::internal::common::resources::ResourceId>) } } body_len=301
[2023-06-26T12:54:55.061Z DEBUG hyperqueue::worker::start] Starting program launcher task_id=3 res=ResourceRequest { n_nodes: 0, resources: [ResourceRequestEntry { resource_id: ResourceId(0), request: Compact(32) }], min_time: 39600s } alloc=Allocation { nodes: [], resources: [ResourceAllocation { resource: ResourceId(0), value: Indices([ResourceIndex(96), ResourceIndex(97), ResourceIndex(98), ResourceIndex(99), ResourceIndex(100), ResourceIndex(101), ResourceIndex(102), ResourceIndex(103), ResourceIndex(104), ResourceIndex(105), ResourceIndex(106), ResourceIndex(107), ResourceIndex(108), ResourceIndex(109), ResourceIndex(110), ResourceIndex(111), ResourceIndex(112), ResourceIndex(113), ResourceIndex(114), ResourceIndex(115), ResourceIndex(116), ResourceIndex(117), ResourceIndex(118), ResourceIndex(119), ResourceIndex(120), ResourceIndex(121), ResourceIndex(122), ResourceIndex(123), ResourceIndex(124), ResourceIndex(125), ResourceIndex(126), ResourceIndex(127)]) }], counts: ResourceCountVec { counts: IndexVec([[0, 32]], PhantomData<tako::internal::common::resources::ResourceId>) } } body_len=301
svatosFZU commented 1 year ago

OK, I did a little more digging and progressed a little bit. First, I did not realized the worker nodes have 64 cores. The documentation says 32-core CPU, so I took that not realizing there are two CPUs. So, running two 32-core jobs would be fine. But even after changing it to 64 cores, it does not work very well. The issue seems to be in how slurm takes info from hq-submit.sh:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=hq-alloc-1
#SBATCH --output=/home/svatosm/.hq-server/001/autoalloc/1/004/stdout
#SBATCH --error=/home/svatosm/.hq-server/001/autoalloc/1/004/stderr
#SBATCH --time=12:00:00
#SBATCH -ADD-23-14 -pp02-intel

/home/svatosm/ACmount.sh && /home/svatosm/nightly6/hq worker start --idle-timeout "5m" --time-limit "11h 59m 50s" --manager "slurm" --server-dir "/home/svatosm/.hq-server/001" --on-server-lost "finish-running"; /home/svatosm/ACumount.sh

As there is only one available worker node in the queue, it is easy to demonstrate. When I do an equivalent submission (first using number of nodes then number of cores) using sbatch command, two jobs actually start running on the same worker node when number of nodes is used:

[svatosm@login.cs ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[svatosm@login.cs ~]$ sbatch -A DD-23-14 -p p02-intel -t 12:00:00 -N 1 --gres=fpga /home/svatosm/21031/slurm.sh
Submitted batch job 2997
[svatosm@login.cs ~]$ sbatch -A DD-23-14 -p p02-intel -t 12:00:00 -N 1 --gres=fpga /home/svatosm/21031/slurm.sh
Submitted batch job 2998
[svatosm@login.cs ~]$ sbatch -A DD-23-14 -p p02-intel -t 12:00:00 -N 1 --gres=fpga /home/svatosm/21031/slurm.sh
Submitted batch job 2999
[svatosm@login.cs ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2999 p02-intel slurm.sh  svatosm PD       0:00      1 (None)
              2997 p02-intel slurm.sh  svatosm  R       0:06      1 p02-intel01
              2998 p02-intel slurm.sh  svatosm  R       0:03      1 p02-intel01

If I limit that on number of cores instead of nodes then only one job runs

[svatosm@login.cs ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[svatosm@login.cs ~]$ sbatch -A DD-23-14 -p p02-intel -t 12:00:00 -c 64 --gres=fpga /home/svatosm/21031/slurm.sh
Submitted batch job 3001
[svatosm@login.cs ~]$ sbatch -A DD-23-14 -p p02-intel -t 12:00:00 -c 64 --gres=fpga /home/svatosm/21031/slurm.sh
Submitted batch job 3002
[svatosm@login.cs ~]$ sbatch -A DD-23-14 -p p02-intel -t 12:00:00 -c 64 --gres=fpga /home/svatosm/21031/slurm.sh
Submitted batch job 3003
[svatosm@login.cs ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3002 p02-intel slurm.sh  svatosm PD       0:00      1 (Resources)
              3003 p02-intel slurm.sh  svatosm PD       0:00      1 (Priority)
              3001 p02-intel slurm.sh  svatosm  R       0:04      1 p02-intel01

Also, would '--time-limit "11h 59m 50s"' explain why the 12h does not proceed?

svatosFZU commented 1 year ago

Just to emphasize the significance of it: This makes HyperQueue basically unusable on Slurm. Without the mounting of software, there would be two processes running per core. That means the CPU efficiency would be deep below 50%. With the software repo mounting, it is even worse. As both jobs try to to do the repo mounting, they interfere with each other and fail within few seconds. To make the test above, I actually had to replace the mounting with sleep. Although, it seems to be easy to fix by using "--exclusive" along with "--nodes=1":

[svatosm@login.cs ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[svatosm@login.cs ~]$ sbatch -A DD-23-14 -p p02-intel -t 12:00:00 --nodes=1 --exclusive --gres=fpga /home/svatosm/21031/slurm.sh
Submitted batch job 3013
[svatosm@login.cs ~]$ sbatch -A DD-23-14 -p p02-intel -t 12:00:00 --nodes=1 --exclusive --gres=fpga /home/svatosm/21031/slurm.sh
Submitted batch job 3014
[svatosm@login.cs ~]$ sbatch -A DD-23-14 -p p02-intel -t 12:00:00 --nodes=1 --exclusive --gres=fpga /home/svatosm/21031/slurm.sh
Submitted batch job 3015
[svatosm@login.cs ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3015 p02-intel slurm.sh  svatosm PD       0:00      1 (Resources)
              3013 p02-intel slurm.sh  svatosm  R       0:03      1 p02-intel01
              3014 p02-intel slurm.sh  svatosm  R       0:00      1 p02-intel02
Kobzol commented 1 year ago

Thanks for all the reports, I will take a look tomorrow. And we will of course not release the new version until this is resolved.

Kobzol commented 1 year ago

Ok, so there are a few things to unpack here. I will start with the time limit.

OK, lowering the time-request seems to help. The job is now running. Does it mean that now these two values must be different?

This is a pre-existing behavior that wasn't changed recently. If you set the time limit the same as the time request, it basically means that if the task used exactly the amount of time specified by the time request, the queue would end exactly at the same time as the task, so the task could probably not even finish successfully. However, --time-request was intended as an upper bound, so setting it to the time limit of the queue makes sense! HQ uses a < check for comparing these two, so if they are the same, then the job will not be started. This is basically a bug, and it will be fixed by https://github.com/It4innovations/hyperqueue/pull/603.

Also, would '--time-limit "11h 59m 50s"' explain why the 12h does not proceed?

No, this is a new thing where we give the worker some time (10s) to shutdown sooner, so that the worker stop command can execute, and also to give the worker a bit of a headroom.

I will take a look at the Slurm issue now.

Kobzol commented 1 year ago

Thank you for the detailed investigation regarding Slurm. The behavior is indeed concerning. HQ workers basically expect that they occupy the whole node. There can be multiple HQ workers per node, but they would need to divide their resources exclusively, so that there is absolutely no overlap between them. For that, we would need to somehow read information from Slurm to find out exactly which resources (e.g. CPUs) are available to a given allocation/worker. I will try to investigate whether we can do that.

If not, or it it's too complicated, I will add the --exclusive flag to Slurm today, and send you a new HQ build with all the mentioned fixes applied.

Kobzol commented 1 year ago

Okay so there are basically two issues: 1) HQ doesn't realize that it might not have access to all CPU cores (when a CPU affinity mask is set). Slurm uses these affinity masks to filter the available cores for a Slurm task/job. https://github.com/It4innovations/hyperqueue/pull/604 fixes this. 2) Based on how it is configured, Slurm can create allocations that don't take the whole node, but e.g. have only a single core. I don't want to detect this behaviour or go around it in HQ. In order to make HQ general, it should be the responsibility of the user to configure the hq alloc command in a way that it assigns the intended amount of cores to the Slurm job.

In your situation, Slurm gave the allocation probably just a single core, and thus created two HQ workers on the same node. Because HQ didn't realize this, it has led to overlaps in worker resources, which resulted in the problems that you have mentioned. After 1) is fixed, HQ should realize that it has only a subset of resources available, and therefore there shouldn't be an overlap with any other HQ workers on the same node. If this happens, you will see that e.g. jobs with 64 cores will not be executed, and that the workers only have a single core available. In that case you should reconfigure the hq alloc command to force Slurm to allocate the correct number of cores (e.g. with the -c 64 flag).

svatosFZU commented 1 year ago

Perfect, thanks. So, what would be the way to tell the allocation queue to use 64 cores? Something like --cpus 64 ? Let me know when there is a new nightly to test.

Kobzol commented 1 year ago

Perfect, thanks. So, what would be the way to tell the allocation queue to use 64 cores? Something like --cpus 64 ?

To be clear, this should be handled by parameters passed to Slurm, not to HQ. You could pass --cpus 64 to the autoalloc queue, and then HQ would have more information about the workers in that queue, but it does not use that information in any way at the moment. This parameter will thus not affect what HQ passes to Slurm, because there are just too many various options, and choosing any specific ones would break the behaviour of some users. So you should pass these flags by yourself, e.g. with hq alloc add slurm --time-limit 12h -- -ADD.... -c 64. I will modify the documentation to talk about this use-case.

Let me know when there is a new nightly to test.

Will do that :)

Kobzol commented 1 year ago

@svatosFZU Please try https://github.com/It4innovations/hyperqueue/releases/tag/nightly and let us know if it works for you now.

svatosFZU commented 1 year ago

Sure. But before I have one question about your last comment about passing slurm flags through allocation queue definition. I do not see it in the docs (https://it4innovations.github.io/hyperqueue/stable/deployment/allocation/) but if I understand correctly that comment, the allocation queue command should like this:

hq alloc add slurm --time-limit 12h --worker-start-cmd "/home/svatosm/ACmount.sh" --worker-stop-cmd "/home/svatosm/ACumount.sh" -- -ADD-23-14 -pp02-intel -c 64

right? So, does this only work for cpus or is there a general way of passing batch system directive through HyperQueue?

Kobzol commented 1 year ago

It is written in the Allocation queue section:

To create a new allocation queue, use the following command and pass any required credentials (queue/partition name, account ID, etc.) after --. These trailing arguments will then be passed directly to qsub/sbatch:

You have already been passing arguments directly to the batch system - -ADD-23-14 and -pp02-intel are arguments passed directly to Slurm :) Everything after -- will be passed to it.

svatosFZU commented 1 year ago

OK, just to be clear on syntax - as there is no space between -A and project, -p and the queue then I assume the correct syntax would be -c64, right?

Kobzol commented 1 year ago

There are no specific syntax rules for the part after --, anything after -- is basically directly written into a bash script which is then submitted using sbatch. So as long as Slurm supports -c 64 with a space, it should be fine also with a space. If it doesn't support the space, then it should be without it.

svatosFZU commented 1 year ago

OK, I have test it. The good news is that submission to slurm now works the way it should. Bad news is that 12h jobs still cannot start and just wait (while 11h start running immediately).

Kobzol commented 1 year ago

Can you verify that you can submit 12h jobs using the same configuration manually with sbatch and that they start running (immediately)?

svatosFZU commented 1 year ago

Yes, to simplify the testing, I made sleep.sh script which does just "sleep 60". It starts immediately:

[svatosm@login.cs ~]$ sbatch -A DD-23-14 -p p02-intel -c 64 -t 12:00:00 /home/svatosm/21031/sleep.sh
Submitted batch job 3060
[svatosm@login.cs ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3060 p02-intel sleep.sh  svatosm  R       0:11      1 p02-intel01
[svatosm@login.cs ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3060 p02-intel sleep.sh  svatosm CD       1:00      1 p02-intel01
svatosFZU commented 1 year ago

Just to clarify, the slurm job created by allocation queue also starts running quickly. But the hq jobs are not starting in it.

Kobzol commented 1 year ago

Just to clarify, the slurm job created by allocation queue also starts running quickly. But the hq jobs are not starting in it.

Aha, I see! Yeah, I think that I know where the problem is. The autoallocator can now submit the jobs, but the HQ scheduler is not scheduling tasks to the worker, because of it's (newly reduced) timelimit, which will be something like 11h 50m, as you have correctly remarked before.

I wonder what's the best way to resolve this. We could just go back to the previous behaviour and keep queue limit/worker timelimit/task time request all to 12h, then it should work. But it would also mean that there would be no "breathing room" for executing the worker stop command, and PBS/Slurm could kill the worker too soon.

If you used e.g. hq alloc add slurm --time-limit 12h, would you mind if HQ actually submitted a job with a max. time limit e.g. 12h and 2 minutes? So that tasks and workers will just consider 12h, but we will leave some time (e.g. 2 minutes) for clean worker shutdown, worker stop command execution etc.

svatosFZU commented 1 year ago

Well, I understand that there is need for some "breathing room". The thing is that it is hidden from the user. Maybe there could be a breakdown in "hq job info" output - running time limit and cleaning time limit? With job and allocation queue set to the same time limit (usually the length of the batch queue), I guess it would be better to go below rather than above (as that could result in batch kill). So, can the batch job be set to run for 12h and the hq job only for 11:5*?

Kobzol commented 1 year ago

Well, yes, but in that case it is the responsibility of the user (i.e. you) - to ask for tasks with time request set to a lower value. We shouldn't automagically reduce the duration of a task time request, because we don't know anything about the task - it might not even be executed inside a batch allocation!

Good point with the queue limits, that could indeed pose a problem. In that case I think that we will just have to get rid of the "breathing room". It's a best effort thing anyway, there's no guarantee that the worker stop command will actually be executed.

Btw, this use-case (to use a task with a time limit set to the time limit of the allocation queue) is a bit unusual for HQ, as we usually expect that you will execute many tasks within a single allocation. That can happen even in this case, if you partition the allocation e.g. into two tasks, by the amount of used cores, for example, but probably there won't be a lot of such tasks. It's a perfectly valid use-case, and it's still probably slightly easier to do with HQ than with Slurm directly, but it just surprised me :) Which is also why you have hit so many edge cases that we had to fix.

svatosFZU commented 1 year ago

I think the whole problem for the users is that this is hidden from them. That is why it is problem to do anything automatically - HyperQueue cannot do it because it does not know tasks, users cannot do that because it is not known to them. Sure I can set job time limit to less than 12h (as long as I get the syntax right). But even with that, I do not know how long the HQ would need to clean up. So, as allocation queue has now --worker-start-cmd and --worker-stop-cmd options. Could it have also something like --allowTimeForCleanup to go with those commands? That way it is user's decision/responsibility to set it but HyperQueue would do the calculation of necessary time and length setting of jobs. And if the users do not use it, it is their risk.

Kobzol commented 1 year ago

We could expose some configuration parameter like that to the users, yes. Basically, it should be enough to expose the --time-limit of the worker being spawned in the allocation, under the name e.g. --worker-time-limit.

An example (assumes that users submit tasks with --time-request 12h):

FWIW, even though we do not guarantee it, the cleanup should probably be executed most of the time, even without any breathing room or cleanup parameter. Unless the task really takes something like 11h 59m (which is borderline "dangerous" anyway, as it might not finish in time), the task will end sooner, and a few minutes after that the worker will end, because it won't have anything to do, then the cleanup code will run and the allocation will end.

svatosFZU commented 1 year ago

That sound reasonable. By the way, what would happen it the user would define only worker-time-limit?

Kobzol commented 1 year ago

--time-limit is currently a required argument when defining the allocation queue, so the command would fail. We need to know the duration of the allocation when submitting it, the worker time limit is then derived from it (unless it will be overridden with the --worker-time-limit argument).

svatosFZU commented 1 year ago

That is good to know. So, yes, I agree with the worker-time-limit proposal.

Kobzol commented 1 year ago

I have implemented --worker-time-limit, but after some further thinking I realized that this command will not solve your issue with matching tasks with time requests to the allocation queue (it is still useful for giving headroom for worker stop commands though).

If you submit a task with --time-request 12h, and create a queue with --time-limit 12h, HQ will not schedule this task onto workers from this queue. Actually, even if you use --time-limit "12h 1m", the task might not be scheduled there. Because there will only be a one minute window during which the scheduler knows that the worker will live for at least 12h. But after that, the scheduler sees that the worker will live for <12h, and therefore it won't schedule tasks with time request 12h onto it.

Therefore I would suggest you to always use a smaller time request than is the time limit of the allocation. I will add this information to the documentation so that it is more clear.

Actually, I wonder why do you use --time-request at all. What is your use-case for it?

svatosFZU commented 1 year ago

Well, the the intention is to have guaranteed amount of time for the job that runs inside of the HQ. I have similar time limits inside of the payload and I want to be sure that the job has enough time to finish. For example, I run four 32-core jobs inside of 128-core allocation. Lets assume that the four jobs would start quite close together but one of them would finish after, lets say, six hours (for some reason). I wanted to ensure that a new job would not start because the payload would expect having more time than that (as this number is the same for all jobs) and inevitably fail. If there is a different way to ensure this, I could drop the time request.

Kobzol commented 1 year ago

Ensuring that there is enough time to execute a task is the intended usafe of --time-request, so that is fine. Submitting a task with --time-request 1h says: if you do not have at least one hour of time, do not even bother starting this task.

It's important to think about what happens when we combine this with worker's time limit. Starting a worker with --time-limit 1h says: you have one hour available for computing tasks. But this time changes dynamically, of course. So after one minute, that same worker only has 59 minutes for computing tasks, and after that one minute, no task with --time-request 1h will be scheduled to it!

So using the same duration for both --time-request for task T and --time-limit for worker W will result in T not being scheduled onto W. It will take some time (even if it was just a few milliseconds) before the scheduler registers the worker and starts scheduling tasks to it, and at that point it will have less time left than its original time limit, and therefore T won't be scheduled onto it.

If you use e.g. --time-request 12h and --time-limit 12h 1m, then the task should be scheduled onto the worker, assuming that the task is ready for computation immediately when the worker starts, and assuming that no other task will be scheduled onto the worker instead. But that's a lot of assumptions :)

So, to sum up, I think that you're using --time-request correctly, but you should probably lower the value that you put there (or, conversely, use a higher --time-limit).

svatosFZU commented 1 year ago

OK, I think the situation is understood. So, I tried to test the latest nightly and submit one job:

[svatosm@login.cs ~]$ /home/svatosm/nightly8/hq alloc add slurm --backlog 1 --worker-time-limit 11h --time-limit 12h --worker-start-cmd "/home/svatosm/ACmount.sh" --worker-stop-cmd "/home/svatosm/ACumount.sh" -- -ADD-23-14 -pp02-intel -c64
2023-07-10T11:36:20Z INFO A trial allocation was submitted successfully. It was immediately canceled to avoid wasting resources.
2023-07-10T11:36:20Z INFO Allocation queue 1 successfully created
[svatosm@login.cs ~]$ /home/svatosm/nightly8/hq submit --time-request=11h --crash-limit=1 --cpus=64 /home/svatosm/21031/hq.sh
Job submitted successfully, job ID: 1
[svatosm@login.cs ~]$ /home/svatosm/nightly8/hq job list
+----+------+---------+-------+
| ID | Name | State   | Tasks |
+----+------+---------+-------+
|  1 | bash | WAITING | 1     |
+----+------+---------+-------+
[svatosm@login.cs ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3064 p02-intel hq-alloc  svatosm CA       0:00      1 
              3067 p02-intel hq-alloc  svatosm PD       0:00      1 (Resources)
              3065 p02-intel hq-alloc  svatosm  R       0:33      1 p02-intel01
              3066 p02-intel hq-alloc  svatosm  R       0:18      1 p02-intel02

The HyperQueue is spawning allocations (presumably because I used backlog) but the job is there still waiting even though the time limit is 12h and time request is 11h.

Kobzol commented 1 year ago

Sorry, I think that my last explanation wasn't clear enough regarding time limits (the naming is confusing, as always.. :) ). The important thing for scheduling is the time limit of the worker. For workers allocated by an allocation queue, it is by default set to the time limit of the queue itself. In this case, the time limit of the queue is 12h, but you have overridden the worker limit to be 11h. So you are back at a situation where time request == time limit and the task won't be scheduled. (Also this confirms that --worker-time-limit works as it should :) ).

We have mixed several use-cases in this issue, that's probably what also caused the confusion.

To reiterate, here are the two use-cases that we have been talking about: 1) To make sure that tasks will be scheduled, use time request lower than time limit. You don't need to set --worker-time-limit for this, but if you do, you need to make sure that time request < --worker-time-limit. 2) If you want to give the worker stop command more time to execute, you can artificially lower the time limit of the spawned workers using --worker-time-limit. This is pretty much only useful for the worker stop command! Othewise you don't need --worker-time-limit.

So if you remove --worker-time-limit (or set it to e.g. 11h 55m), then the task should be scheduled.

svatosFZU commented 1 year ago

Right, and as my use-case combines these two, I tried and set each limit to a different number and it works:

[svatosm@login.cs ~]$ /home/svatosm/nightly8/hq alloc add slurm --backlog 1 --worker-time-limit 11h59m --time-limit 12h --worker-start-cmd "/home/svatosm/ACmount.sh" --worker-stop-cmd "/home/svatosm/ACumount.sh" -- -ADD-23-14 -pp02-intel -c64
2023-07-10T11:57:57Z INFO A trial allocation was submitted successfully. It was immediately canceled to avoid wasting resources.
2023-07-10T11:57:57Z INFO Allocation queue 2 successfully created
[svatosm@login.cs ~]$ /home/svatosm/nightly8/hq submit --time-request=11h30m --crash-limit=1 --cpus=64 /home/svatosm/21031/hq.sh
Job submitted successfully, job ID: 2
[svatosm@login.cs ~]$ /home/svatosm/nightly8/hq job list
+----+------+---------+-------+
| ID | Name | State   | Tasks |
+----+------+---------+-------+
|  2 | bash | WAITING | 1     |
+----+------+---------+-------+
There are 2 jobs in total. Use `--all` to display all jobs.
[svatosm@login.cs ~]$ /home/svatosm/nightly8/hq job list
+----+------+---------+-------+
| ID | Name | State   | Tasks |
+----+------+---------+-------+
|  2 | bash | RUNNING | 1     |
+----+------+---------+-------+
There are 2 jobs in total. Use `--all` to display all jobs.
[svatosm@login.cs ~]$ /home/svatosm/nightly8/hq job list --all
+----+------+----------+-------+
| ID | Name | State    | Tasks |
+----+------+----------+-------+
|  1 | bash | CANCELED | 1     |
|  2 | bash | FINISHED | 1     |
+----+------+----------+-------+

So, the last question that remains is when can this be released?