[guidance request] 'gres' : 'gpu' does not have the same behavior as when using bash

lscarton commented 5 months ago

Details

Slurm Version: 23.11
Python Version: 3.9.18
Cython Version: 3.0.8
PySlurm Branch: main (23.11)
Linux Distribution: Rocky Linux 9.3 (Blue Onyx)

Issue

Let me first thank you for this amazing library.

When submitting a job using GPU, I am required to add --gres=gpu.

Unfortunately, when I use PySlurm I do not get GPU, whereas when I use a bash script I get it. I include both the bash and python scripts and the relative output at the bottom. I have tried few way such as 'gres':'gpu, 'gres':'gpu:1', 'gres_per_node':'gpu' , ... I have also check the naming Gres=gpu:1(S:0) using scontrol show node nodename.

Could you please guide me, as probably I am missing something trivial.

Thank you so much for your support and guidance.

bash code:

#!/bin/bash

#SBATCH --output=slurm/cuda_test-%j.out    # STDOUT
#SBATCH --error=slurm/cuda_test-%j.err     # STDERR
#SBATCH --ntasks=32             # use 110 task
#SBATCH --mem=32G               # memory per node in MB (different units with suffix K|M|G|T)
#SBATCH --time=0-23:00:00         # total runtime of job allocation ((format D-HH:MM:SS; first parts optional)
#SBATCH --partition=gpu
#SBATCH --gres=gpu
#SBATCH --mail-type=ALL        
#SBATCH --mail-user= ###

source /home/$USER/.bashrc
source activate pytorchenv3.9
nvidia-smi

outputs:

Fri Feb  9 17:20:48 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           On  | 00000000:3B:00.0 Off |                  Off |
| N/A   33C    P0              25W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Python script:

import pyslurm

def main():
    job_setup = {
        "ntasks_per_node": 32,
        "job_name": "pyslurm_gpu_test",
        "partition": "gpu",
        "error": "slurm/pyslurm-test-%j.err",
        "gres": "gpu", #  which I also tried as "gres": "gpu:1"  or "gres_per_node": "gpu"
        "output": "slurm/pyslurm-test-%j.out",
        "realmem": 180000,
        "time_limit": 360,
        "wrap": f"""
            source /home/$USER/.bashrc
            source activate pytorchenv3.9
            nvidia-smi
            """,
    }

    job_id = pyslurm.job().submit_batch_job(job_setup)

if __name__ == "__main__":
    main()

outputs:

No devices were found

tazend commented 5 months ago

Hi @lscarton

I see you are using the pyslurm.job class. Note that this class is deprecated and isn't really maintained anymore so it might still contain some bugs like the one you are hitting.

My suggestion would be to use the pyslurm.JobSubmitDescription (even though it says the docs are for 23.2, there isn't really a lot that changed with 23.11, so it's still pretty accurate), which is a new class created specifically to cover the Job submission.

You could do something like this then:

import pyslurm

def main():
    script =  """source /home/$USER/.bashrc
source activate pytorchenv3.9
nvidia-smi
    """

    job_desc = pyslurm.JobSubmitDescription(
      gpus = 1,
      ntasks_per_node = 32,
      name = "pyslurm_gpu_test",
      standard_error = "slurm/pyslurm-test-%j.err",
      partition = "gpu",
      standard_output = "slurm/pyslurm-test-%j.out",
      memory_per_node = "180G",
      time_limit = "00:05:00",
      script = script
    )

    job_id = job_desc.submit()
    job = pyslurm.Job.load(job_id)
    print(job.gres_per_node)

if __name__ == "__main__":
    main()

You can also further verify that your job got a GPU allocated when checking with scontrol show job

lscarton commented 5 months ago

Hi @tazend, I am very grateful for your guidance! it works amazingly!

I only had to add #!/bin/bash at the beginning of the script and pay attention to the indentation of the script, which sometime was reporting sbatch: error: This does not look like a batch script. The first sbatch: error: line must start with #! followed by the path to an interpreter. sbatch: error: For instance: #!/bin/sh.

It was passing the _validate_batch_script as the script started with #!/bin/bash , but i believe a pretty indentation was causing problems.

Thanks again to @tazend for the fantastic support and thanks to all the contributors to this amazing library.

PySlurm / pyslurm

[guidance request] 'gres' : 'gpu' does not have the same behavior as when using bash #337

Details

Issue