NERSC Slurm script - Githubissues

mohawk811 commented 2 days ago

2024-10-24 17:38:10.460311: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
CUDA backend failed to initialize: FAILED_PRECONDITION: No visible GPU devices. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Traceback (most recent call last):
  File "/global/u1/m/mh3705/scratch/quasi_pol/desc_10_21_24/geo_aspect_scan_gpu/quasi_pol.py", line 50, in <module>
    set_device('gpu')
  File "/global/u1/m/mh3705/scratch/DESC/desc/__init__.py", line 94, in set_device
    devices = nvgpu.gpu_info()
  File "/global/homes/m/mh3705/.conda/envs/desc_gpu-env/lib/python3.9/site-packages/nvgpu/__init__.py", line 13, in gpu_info
    lines = _run_cmd(['nvidia-smi'])
  File "/global/homes/m/mh3705/.conda/envs/desc_gpu-env/lib/python3.9/site-packages/nvgpu/__init__.py", line 32, in _run_cmd
    output = subprocess.check_output(cmd)
  File "/global/homes/m/mh3705/.conda/envs/desc_gpu-env/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/global/homes/m/mh3705/.conda/envs/desc_gpu-env/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['nvidia-smi']' returned non-zero exit status 6.
srun: error: nid003677: task 0: Exited with exit code 1
srun: Terminating StepId=32150174.0

Would it be possible if someone could share their slurm script for using a gpu for DESC on NERSC? I tried a number of things but this is the error I get from trying. I feel confident that my installation is fine from running the tests listed on the documentation.


#SBATCH --nodes=1
#SBATCH --image=ghcr.io/nvidia/jax:jax
#SBATCH --time=1-00:00:00
#SBATCH --constraint=gpu
#SBATCH --qos regular
#SBATCH --account=m4680
#SBATCH --output=log.out
#SBATCH --error=log.err

module load conda
conda activate desc_gpu-env
module load cudatoolkit/12.2
module load cudnn/8.9.3_cuda12
srun --module=gpu --image=ghcr.io/nvidia/jax:jax /global/homes/m/mh3705/.conda/envs/desc_gpu-env/bin/python quasi_pol.py
'''

This is my current slurm script.

rahulgaur104 commented 1 day ago

Hi @mohawk811, Have you tried the instructions given here to install DESC on Perlmutter? If not, please try that first instead of using a jax image.

mohawk811 commented 1 day ago

Yes I followed those instructions and when I test the installation everything loads in fine

rahulgaur104 commented 1 day ago

Ok, then use the following slurm script and you should be able to run a job.

#!/bin/bash

#SBATCH --qos=regular
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --constraint=gpu
#SBATCH --gpus 1
#SBATCH --mem=32G
#SBATCH --account=m4680
#SBATCH --time=03:25:00

export XLA_PYTHON_CLIENT_MEM_FRACTION=.93

XLA_PYTHON_CLIENT_MEM_FRACTION=.93

module load cudatoolkit/12.2;\
module load cudnn/8.9.3_cuda12;\

module load python;\

conda activate desc-env;\

python3 driver_ball_omni.py

PlasmaControl / DESC

NERSC Slurm script #1325