PySlurm / pyslurm

Python Interface to Slurm
https://pyslurm.github.io
GNU General Public License v2.0
474 stars 116 forks source link

Submitting batch job fails randomly with broken paths #260

Open Sideboard opened 1 year ago

Sideboard commented 1 year ago

Details

Issue

Submitting jobs (both via script or wrap) fails randomly. An immediate indicator is that the work_dir (and other paths like std_out and std_err) are broken strings on those cases:

>>> psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5'}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])
7727734 PENDING /correct/work/dir
>>> psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5'}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])
7727735 PENDING ��͜�
>>> psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5'}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])
7727736 PENDING ��͜�
>>> psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5'}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])
7727737 PENDING ��͜�
>>> psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5'}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])
7727738 PENDING /correct/work/dir

For a failing job:

>>> bytes(job['work_dir'], 'utf-8')
b'\xef\xbf\xbd\xef\xbf\xbd/\xef\xbf\xbd\xef\xbf\xbd\x7f'
>>> bytes(job['std_out'], 'utf-8')
b'\xef\xbf\xbd\xef\xbf\xbd/\xef\xbf\xbd\xef\xbf\xbd\x7f/slurm-7736231.out'
>>> bytes(job['std_err'], 'utf-8')
b'\xef\xbf\xbd\xef\xbf\xbd/\xef\xbf\xbd\xef\xbf\xbd\x7f/slurm-7736231.out'

Any idea what is going wrong?

tazend commented 1 year ago

Hi,

could you try out the slurm-20.11.8 branch? (https://github.com/PySlurm/pyslurm/tree/slurm-20.11.8) It is some commits ahead of the tag, perhaps this can do the trick.

So you submitted the jobs from the directory /correct/work/dir, right? Does scontrol show job <id> show the correct paths, or is it also broken there? I'll also try to reproduce it on my side.

Sideboard commented 1 year ago

The strings are also broken in sacct and scontrol.

$ scontrol show job 7750363
⋮
   WorkDir=���*�
   StdErr=���*�/slurm-7750363.out
   StdIn=/dev/null
   StdOut=���*�/slurm-7750363.out
⋮
Sideboard commented 1 year ago

I switched to branch slurm-20.11.8 but that did not help. Could it be a mismatch with C string lengths? How can I debug this?

submit_batch_job() also ignores the option 'time': '00:02:00'. It uses the default time instead. Could there be a connection?

pyslurm.job().submit_batch_job({
    'wrap': 'sleep 60',
    'time': '00:02:00',
})
mcsloy commented 1 year ago

Looking at the byte data you can see the reoccurring byte pattern EF BF BD. This is the UTF-8 hex encoding for the special character REPLACEMENT CHARACTER, see this link for more info. This character is used during encoding and decoding to replace erroneous data. For example FF FF FF is not a valid UTF-8 character and will thus be replaced with EF BF BD during an encoding/decoding attempt. An error is not raised during execution due to the use of replacement based error handling in PySlurm; i.e. .encode("UTF-8", "replace"). While these characters are emerging during the encoding/decoding process the non-deterministic nature of the error suggest that the encoding issue is only a symptom. The most likely culprit is either an overflow error or the memory location being freed up before the code is actually done with the data stored there.

I suspect this issue will be localised to the work_dir variable only. With the std_out, std_err errors being due to them likely being created by appending other strings to the erroneous work_dir.

tazend commented 1 year ago

Hi,

yeah culprit is definitely work_dir, with std_err and std_out only being side-effects since the slurmctld puts the logs per default into the work-dir.

The encoding step itself should be fine, however it has likely to do with the lifetime of the char* pointer for work_dir, since this is done in the code (if no work-dir has been specified and we just use the current work dir as default, as sbatch does):

cwd = os.getcwd().encode("UTF-8", "replace")
desc.work_dir = cwd

This itself is fine, however this code is in a different function than the one actually submitting the job. By the time the function (fill_job_desc_from_opts) which contains this code is done, desc.work_dir is basically undefined, because the lifetime of desc.work_dir is tied to cwd (a python object) which has no more references and pythons garbage collector may free the memory up at any time - hence the random fails, sometimes it survives long enough, sometimes is doesn't.

You won't see this behaviour though when you explicitly specify the work_dir - the python object will live long enough since it is in the dict you are specifying on executing submit_batch_job.

Anyway, in this case a quick fix in the code would be to modify the incoming dict with the user-options and manually insert the work_dir if nothing is specified, setting it to the output of getcwd and using this for encoding, as it will live long enough. Not really the nicest, since we manipulate the incoming input , but will suffice in this case. I can make a fix for it.

The long-term fix would be to restructure the job API in a way that things like these can't happen anymore (working on it)

The time option from sbatch is actually called time_limit in pyslurm.

Sideboard commented 1 year ago

The problem still persist if work_dir is included in the job options:

Submitted job 7762983 with {'wrap': 'sleep 5', 'work_dir': '/my/work/dir', 'get_user_env_time': -1}
7762983 | PENDING | 2880 | ��+�

So even the job_opts dict is garbage collected? Or at least in the context of a flask application. :hourglass_flowing_sand: No, it makes sense. Since in both cases a new object is created through encode() and mapped to desc.work_dir within fill_job_desc_from_opts().

Sideboard commented 1 year ago

The time option from sbatch is actually called time_limit in pyslurm.

Oh, time_limit, thanks. I thought it was time as with sbatch --time since the docstring for submit_batch_job says:

Submit batch job.
* make sure options match sbatch command line opts and not struct member names.
tazend commented 1 year ago

Mh weird,

I can replicate the erroneous symbols if I don't supply a work_dir, however explicitly setting it works for me:

import pyslurm; psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5', 'work_dir': '/my/work/dir', 'get_user_env_time': -1}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])

mh - wondering why its not working for you with that. (I'm on 22.05, though it is still the same code in pyslurm)