Closed DaniJonesOcean closed 2 months ago
set_gpu_default_device()
returns "No GPU is available", despite torch
having been imported
@eredding02 For reference, a supposedly working example:
Made some progress on this issue. First, we load the system-level CUDA and cuDNN libraries:
module load python3.10-anaconda/2023.03
module load cuda/11.8.0
module load cudnn/11.8-v8.7.0
Next, we activate a virtual environment in which to test out the GPU interface:
source ~/deepsensor_env_gpu/bin/activate # assumes virtual environment already exists, with DeepSensor installed
Next, we need to install a version of pytorch
that is compatible with CUDA 11.8. Using instructions from the pytorch website, I selected a stable Linux build using pip-installed Python for CUDA 11.8. The website returned this command for me to use:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
This creates an environment in which the following code runs with no errors, indicating that DeepSensor is able to select the GPU defaults on U-M HPC:
import logging
logging.captureWarnings(True)
import deepsensor.torch
from deepsensor.data import DataProcessor, TaskLoader, construct_circ_time_ds
from deepsensor.model import ConvNP
from deepsensor.train import set_gpu_default_device
# Run on GPU if available by setting GPU as default device
set_gpu_default_device()
This works fine on the command line, i.e. when logging into U-M HPC via the terminal and submitting a job to the GPU queue. Here is an example slurm script:
#!/bin/bash
#SBATCH --job-name=deepsensor_gpu_test # Descriptive job name
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --account=account_name # Account/budget name
#SBATCH --partition=gpu # Request GPU resource
#SBATCH --gpus=1 # Total number of GPUs for the job
#SBATCH --mem=4G # Amount of memory required per GPU
#SBATCH --cpus-per-task=4 # Number of CPU cores per GPU
#SBATCH --time=00:30:00 # Time limit hrs:min:sec
#SBATCH --output=deepsensor_gpu_test_%j.log # Standard output and error log
pwd; hostname; date
echo "Loading modules..."
module load python3.10-anaconda/2023.03
module load cuda/11.8.0
module load cudnn/11.8-v8.7.0
echo "Activating virtual environment..."
source ~/deepsensor_env_gpu/bin/activate
echo "Running Python script..."
python ~/DeepSensor/GPU-test/deepsensor_gpu_test.py
date
Where the file deepsensor_gpu_test.py
contains the Python above that executed without error. This is good news, but however, this procedure does not yet work when starting up a Jupyter notebook. We might need some help here.
Additional context: CUDA and cuDNN don't need to be installed via pip as they're system-level tools, and the Python ecosystem interfaces with them through environment variables and shared libraries.
NOTE: The above is just for pytorch
. Using tensorflow
might require a different approach
I've submitted this as a ticket to U-M IT (6577955)
Great news! If you initialize the Jupyter Notebook session with a script that activates the pre-built virtual environment (the GPU enabled one), all you have to do is select the appropriate kernel when creating a new notebook, as shown in the image.
Note: U-M IT directed me to this reference on HPC computing, which I found helpful:
@eredding02 I believe that I've resolved this issue. I've added lots more information on environment setup (both without and with GPUs) here:
https://github.com/CIGLR-ai-lab/GreatLakes-TempSensors/blob/main/ENVIRONMENT_SETUP.md
Let me know if you have any questions! Once you are able to run the set_gpu_default_device()
function without error, let me know and we can close this issue.
@eredding02 Closing because I think we've solved this.
Feel free to re-open if there are still aspects that you'd like to explore
Description: The current DeepSensor documentation provides a
set_gpu_default_device()
function but lacks comprehensive details for users to fully exploit this feature, particularly in an HPC environment where GPU resources are allocated through Slurm directives. We may need to do some experimentation and coordinate with our support communities to figure out how to get this to work.Goals:
RuntimeError
when no GPU is available).Steps:
set_gpu_default_device()
effectively.--gpus
,--gpus-per-node
, etc.) specific to job submissions on Great Lakes HPC.Expected Outcome: After completing this issue, we should be able to execute computational tasks (e.g. training) with GPU acceleration on Great Lakes HPC, hopefully leading to improved performance.
Additional Context: The ability to utilize GPUs can significantly speed up the training and processing times for neural networks, including those used in the DeepSensor framework. With proper documentation and support (e.g. Great Lakes HPC help desk, the U-M Coderspace slack, the DeepSensor slack), we can optimize use of the Great Lakes HPC resources for environmental machine learning tasks.