PySlurm / pyslurm

Python Interface to Slurm
https://pyslurm.github.io
GNU General Public License v2.0
474 stars 116 forks source link

Job submission time limit #245

Closed davidoskky closed 1 year ago

davidoskky commented 1 year ago

Details

Issue

I'm unable to run any job through pyslurm on the cluster where I work, I believe this is connected with setting the time limit of the job. In my cluster jobs without a time limit are rejected automatically with this message: sbatch: error: Batch job submission failed: Time limit specification required, but not provided.

Working sbatch command sbatch --time=01:00:00 --mem=2 --wrap="sleep 100"

Failing pyslurm script

import pyslurm
test_job = {
    "time": "01:00:00",
    "mem": 2,
    "wrap": "sleep 100",
}
test_job_id = pyslurm.job().submit_batch_job(test_job)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyslurm/pyslurm.pyx", line 2888, in pyslurm.pyslurm.job.submit_batch_job
ValueError: ('Batch job submission failed: %s', None)

Is this actually related with setting the time limit or could it be some other problem? I tried substituting time with other strings I found browsing the repository, such as time_limit, time_limit_str and some others but none worked.

tazend commented 1 year ago

Hi @davidoskky

Well, just had a look at the current code. Turns out it lacks some error handling... Options like "time_limit" (and many others) currently only accept raw integer values. So when you want to set the time_limit to 1 hour, you need to specify the time in minutes - so in this case 60.

If you specify anything other than an integer, it will raise a TypeError (detected by cython internally). Because, the code is like this:

if job_opts.get("time_limit"):
    desc.time_limit = job_opts.get("time_limit")

Passing something like "01:00:00" would try to assign a string to a raw uint32_t value (what time_limit actually is internally in slurm) - which of course goes heavily wrong.

The TypeError however is only detected, but ignored and the program keeps running (even though it shouldn't) - because the code is inside a cdef function which returns a C type. And functions returning a C type will always ignore exceptions raised inside it by default. So it continues and passes a not completely setup pointer to the slurmctld which will get rejected and you are left with a not very detailed error message.

So yeah, error handling needs to be improved - but that doesn't change the fact that the type of values which can be passed is pretty restricted and unexpected right now (setting time_limit should be possible via string and raw integer value, for example).

I have been putting work into reworking and improving the Job submission API lately (https://github.com/PySlurm/pyslurm/issues/224 - work in progress, currently just for Slurm 22.05 though), and hope to get it ready soon.

davidoskky commented 1 year ago

Thank you very much, I was able to make it work using the number of minutes. Thank you for the great work you're conducting on this wonderful project.