facebookincubator / submitit

Python 3.8+ toolbox for submitting jobs to Slurm
MIT License
1.3k stars 125 forks source link

Submitit with sbatch #1726

Open pfrwilson opened 1 year ago

pfrwilson commented 1 year ago

Hello, thanks for the great project!

I am working with a slurm cluster which requires compute jobs to be submitted with sbatch rather than srun. It appears that submitit uses srun to submit jobs. Therefore it crashes with a gres-not-available error when I use submitit. Is sbatch supported with submitit?

Thanks in advance!

Paul

hkayabilisim commented 1 year ago

The same thing happens to me too!

I think the library submits via sbatch but inside the batch script, the python command is send with srun. And it does not recognize gres. The exact error message is:

srun: error: Unable to create step for job 25393: Invalid generic resource (gres) specification

jrapin commented 1 year ago

Hello, I've never actually used sbatch without srun, nor seen any slurm cluster which did not support srun. Are you sure srun is unavailable and not some other resource that is unavailable (partition, gpus, etc)? Which version of slurm are you using? If thi is the case, this is probably not something will support unless someone wants to provide a fix (I don't have the bandwidth for this, sorry)

pfrwilson commented 1 year ago

Hi,

Yes, my slurm cluster fails to allocate GPUs if a job is submitted through srun directly rather than as a batch script through sbatch. The srun command can be used inside the slurm batch submission, but if it is run by itself (eg. srun main.py), prompting the system for synchronous output, the system fails to allocate GREs. I'll have to ask the administrator why the choice was made not to support srun. Typical wait times on the system are long anyway.

I was wondering if submitit to submit the job as you would an sbatch script (eg. sbatch main.py), but sounds like that is not supported... no worries !

gwenzek commented 1 year ago

Submitit is calling sbatch submission.sh. You can find the submission.sh file in the log folder. The generated sbatch does contain a call to srun, but that's standard behavior AFAIK.

hkayabilisim commented 1 year ago

I've realized that in the SLURM installation used in our supercomputing center, the resource definition (gres) used in submitit causes an error. This is something related to the configuration of SLURM in our center.

pfrwilson commented 1 year ago

If submitit is calling sbatch internally, that shouldn’t be the issue then I guess.

Maybe the issue I am having is more similar to what Husein reports. I will check out the submission.sh file and see whether running it directly throws errors

Paul Wilson


From: Huseyin Kaya @.> Sent: Thursday, March 2, 2023 10:40:38 AM To: facebookincubator/submitit @.> Cc: Paul Wilson @.>; Author @.> Subject: Re: [facebookincubator/submitit] Submitit with sbatch (Issue #1726)

I've realized that in the SLURM installation used in our supercomputing center, the resource definition (gres) used in submitit causes an error. This is something related to the configuration of SLURM in our center.

— Reply to this email directly, view it on GitHubhttps://github.com/facebookincubator/submitit/issues/1726#issuecomment-1452076168, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASZZK2R2W6QS7FVV4FB63O3W2C5PNANCNFSM6AAAAAATXLQYOY. You are receiving this because you authored the thread.Message ID: @.***>