Open pfrwilson opened 1 year ago
The same thing happens to me too!
I think the library submits via sbatch
but inside the batch script, the python command is send with srun
. And it does not recognize gres. The exact error message is:
srun: error: Unable to create step for job 25393: Invalid generic resource (gres) specification
Hello, I've never actually used sbatch without srun, nor seen any slurm cluster which did not support srun. Are you sure srun is unavailable and not some other resource that is unavailable (partition, gpus, etc)? Which version of slurm are you using? If thi is the case, this is probably not something will support unless someone wants to provide a fix (I don't have the bandwidth for this, sorry)
Hi,
Yes, my slurm cluster fails to allocate GPUs if a job is submitted through srun directly rather than as a batch script through sbatch. The srun command can be used inside the slurm batch submission, but if it is run by itself (eg. srun main.py
), prompting the system for synchronous output, the system fails to allocate GREs. I'll have to ask the administrator why the choice was made not to support srun. Typical wait times on the system are long anyway.
I was wondering if submitit to submit the job as you would an sbatch script (eg. sbatch main.py
), but sounds like that is not supported... no worries !
Submitit is calling sbatch submission.sh
. You can find the submission.sh
file in the log folder.
The generated sbatch does contain a call to srun, but that's standard behavior AFAIK.
I've realized that in the SLURM installation used in our supercomputing center, the resource definition (gres) used in submitit causes an error. This is something related to the configuration of SLURM in our center.
If submitit is calling sbatch internally, that shouldn’t be the issue then I guess.
Maybe the issue I am having is more similar to what Husein reports. I will check out the submission.sh file and see whether running it directly throws errors
Paul Wilson
From: Huseyin Kaya @.> Sent: Thursday, March 2, 2023 10:40:38 AM To: facebookincubator/submitit @.> Cc: Paul Wilson @.>; Author @.> Subject: Re: [facebookincubator/submitit] Submitit with sbatch (Issue #1726)
I've realized that in the SLURM installation used in our supercomputing center, the resource definition (gres) used in submitit causes an error. This is something related to the configuration of SLURM in our center.
— Reply to this email directly, view it on GitHubhttps://github.com/facebookincubator/submitit/issues/1726#issuecomment-1452076168, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASZZK2R2W6QS7FVV4FB63O3W2C5PNANCNFSM6AAAAAATXLQYOY. You are receiving this because you authored the thread.Message ID: @.***>
Hello, thanks for the great project!
I am working with a slurm cluster which requires compute jobs to be submitted with sbatch rather than srun. It appears that submitit uses srun to submit jobs. Therefore it crashes with a gres-not-available error when I use submitit. Is sbatch supported with submitit?
Thanks in advance!
Paul