Experiment with GPUs for DeepSensor training on Great Lakes HPC

DaniJonesOcean commented 2 months ago

Description: The current DeepSensor documentation provides a set_gpu_default_device() function but lacks comprehensive details for users to fully exploit this feature, particularly in an HPC environment where GPU resources are allocated through Slurm directives. We may need to do some experimentation and coordinate with our support communities to figure out how to get this to work.

Goals:

[ ] Provide detailed steps on how to configure DeepSensor to utilize GPUs on the Great Lakes HPC.
[ ] Update the documentation to include clear examples or use cases that demonstrate the setup and execution of DeepSensor tasks with GPU support.
[ ] Ensure error handling and support for common issues related to GPU utilization (e.g., RuntimeError when no GPU is available).

Steps:

Investigate the current support for GPU utilization in DeepSensor and identify gaps in the documentation.
If needed, collaborate with the maintainers of DeepSensor and the Great Lakes HPC support staff (and coderspace) to understand best practices for GPU usage in the environment.
Draft clear instructions for users, including how to:
- Use set_gpu_default_device() effectively.
- Translate GPU requests into Slurm directives (--gpus, --gpus-per-node, etc.) specific to job submissions on Great Lakes HPC.
- Handle errors and troubleshoot common GPU resource allocation issues.

Expected Outcome: After completing this issue, we should be able to execute computational tasks (e.g. training) with GPU acceleration on Great Lakes HPC, hopefully leading to improved performance.

Additional Context: The ability to utilize GPUs can significantly speed up the training and processing times for neural networks, including those used in the DeepSensor framework. With proper documentation and support (e.g. Great Lakes HPC help desk, the U-M Coderspace slack, the DeepSensor slack), we can optimize use of the Great Lakes HPC resources for environmental machine learning tasks.

DaniJonesOcean commented 2 months ago

set_gpu_default_device() returns "No GPU is available", despite torch having been imported

DaniJonesOcean commented 2 months ago

@eredding02 For reference, a supposedly working example:

https://alan-turing-institute.github.io/deepsensor/user-guide/acquisition_functions.html?highlight=set_gpu_default_device#set-up

DaniJonesOcean commented 2 months ago

Made some progress on this issue. First, we load the system-level CUDA and cuDNN libraries:

module load python3.10-anaconda/2023.03
module load cuda/11.8.0
module load cudnn/11.8-v8.7.0

Next, we activate a virtual environment in which to test out the GPU interface:

source ~/deepsensor_env_gpu/bin/activate   # assumes virtual environment already exists, with DeepSensor installed

Next, we need to install a version of pytorch that is compatible with CUDA 11.8. Using instructions from the pytorch website, I selected a stable Linux build using pip-installed Python for CUDA 11.8. The website returned this command for me to use:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

This creates an environment in which the following code runs with no errors, indicating that DeepSensor is able to select the GPU defaults on U-M HPC:

import logging
logging.captureWarnings(True)

import deepsensor.torch
from deepsensor.data import DataProcessor, TaskLoader, construct_circ_time_ds
from deepsensor.model import ConvNP
from deepsensor.train import set_gpu_default_device

# Run on GPU if available by setting GPU as default device
set_gpu_default_device()

This works fine on the command line, i.e. when logging into U-M HPC via the terminal and submitting a job to the GPU queue. Here is an example slurm script:

#!/bin/bash
#SBATCH --job-name=deepsensor_gpu_test  # Descriptive job name
#SBATCH --mail-type=END,FAIL            # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --account=account_name               # Account/budget name
#SBATCH --partition=gpu                 # Request GPU resource
#SBATCH --gpus=1                        # Total number of GPUs for the job
#SBATCH --mem=4G                        # Amount of memory required per GPU
#SBATCH --cpus-per-task=4               # Number of CPU cores per GPU
#SBATCH --time=00:30:00                 # Time limit hrs:min:sec
#SBATCH --output=deepsensor_gpu_test_%j.log # Standard output and error log

pwd; hostname; date

echo "Loading modules..."
module load python3.10-anaconda/2023.03
module load cuda/11.8.0
module load cudnn/11.8-v8.7.0

echo "Activating virtual environment..."
source ~/deepsensor_env_gpu/bin/activate

echo "Running Python script..."
python ~/DeepSensor/GPU-test/deepsensor_gpu_test.py

date

Where the file deepsensor_gpu_test.py contains the Python above that executed without error. This is good news, but however, this procedure does not yet work when starting up a Jupyter notebook. We might need some help here.

Additional context: CUDA and cuDNN don't need to be installed via pip as they're system-level tools, and the Python ecosystem interfaces with them through environment variables and shared libraries.

DaniJonesOcean commented 2 months ago

NOTE: The above is just for pytorch. Using tensorflow might require a different approach

DaniJonesOcean commented 2 months ago

I've submitted this as a ticket to U-M IT (6577955)

https://teamdynamix.umich.edu/TDClient/30/Portal/Requests/TicketRequests/TicketDet?TicketID=hUP8tF0CvR0_

DaniJonesOcean commented 2 months ago

Screenshot 2024-06-27 at 2 51 26 PM

Great news! If you initialize the Jupyter Notebook session with a script that activates the pre-built virtual environment (the GPU enabled one), all you have to do is select the appropriate kernel when creating a new notebook, as shown in the image.

DaniJonesOcean commented 2 months ago

Note: U-M IT directed me to this reference on HPC computing, which I found helpful:

https://coerc.engin.umich.edu/intro-to-hpc/

DaniJonesOcean commented 2 months ago

@eredding02 I believe that I've resolved this issue. I've added lots more information on environment setup (both without and with GPUs) here:

https://github.com/CIGLR-ai-lab/GreatLakes-TempSensors/blob/main/ENVIRONMENT_SETUP.md

Let me know if you have any questions! Once you are able to run the set_gpu_default_device() function without error, let me know and we can close this issue.

DaniJonesOcean commented 2 months ago

@eredding02 Closing because I think we've solved this.

Feel free to re-open if there are still aspects that you'd like to explore

CIGLR-ai-lab / GreatLakes-TempSensors

Experiment with GPUs for DeepSensor training on Great Lakes HPC #22