GoogleCloudPlatform / cluster-toolkit

Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.
Apache License 2.0
189 stars 125 forks source link

No CUDA devices visible with A2 instances #2634

Closed msis closed 3 months ago

msis commented 3 months ago

Describe the bug

Nodes launched with a modified version of ./examples/ml_slurm.yaml do not seem to see GPU with CUDA

Steps to reproduce

Steps to reproduce the behavior:

  1. create and deploy a cluster with the blueprint ml_slurm_a100.yaml below
  2. SSH to login node
  3. start an a2 instance: srun --partition a10040g1gpu --pty bash -i
  4. conda activate pytorch
  5. nvidia-smi or in a python console import torch; torch.cuda.is_available()

Expected behavior

Actual behavior

$ nvidia-smi
No devices were found
$ python
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False

Version (ghpc --version)

$ ghpc --version
ghpc version v1.34.0
Built from 'main' branch.
Commit info: v1.34.0-0-g5b360ae6

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---
blueprint_name: ml-slurm-v6

vars:
  project_id: ## Set project id here
  deployment_name: ml-training-v6
  region: us-central1
  zone: us-central1-a
  new_image:
    family: ml-training
    project: $(vars.project_id)
  disk_size_gb: 32
  enable_cleanup_compute: true

# Recommended to use GCS backend for Terraform state
# See https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/examples#optional-setting-up-a-remote-terraform-state
#
# terraform_backend_defaults:
#  type: gcs
#  configuration:
#    bucket: <<BUCKET_NAME>>

deployment_groups:
  - group: primary
    modules:
      # Source is an embedded module, denoted by "modules/*" without ./, ../, /
      # as a prefix. To refer to a local module, prefix with ./, ../ or /
      # Example - ./modules/network/vpc
      - id: network
        source: modules/network/vpc

      - id: homefs
        source: modules/file-system/filestore
        use:
          - network
        settings:
          local_mount: /home
          # size_gb: 2560
          # filestore_tier: BASIC_SSD

      - id: script
        source: modules/scripts/startup-script
        settings:
          runners:
            - type: shell
              destination: install-ml-libraries.sh
              content: |
                #!/bin/bash
                # this script is designed to execute on Slurm images published by SchedMD that:
                # - are based on Debian 11 distribution of Linux
                # - have NVIDIA Drivers v530 pre-installed
                # - have CUDA Toolkit 12.1 pre-installed.

                set -e -o pipefail

                echo "deb https://packages.cloud.google.com/apt google-fast-socket main" > /etc/apt/sources.list.d/google-fast-socket.list
                apt-get update --allow-releaseinfo-change
                apt-get install --assume-yes google-fast-socket

                CONDA_BASE=/opt/conda

                if [ -d $CONDA_BASE ]; then
                        exit 0
                fi

                DL_DIR=\$(mktemp -d)
                cd $DL_DIR
                curl -O https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
                HOME=$DL_DIR bash Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -b -p $CONDA_BASE
                cd -
                rm -rf $DL_DIR
                unset DL_DIR

                source $CONDA_BASE/bin/activate base
                conda init --system
                conda config --system --set auto_activate_base False
                # following channel ordering is important! use strict_priority!
                conda config --system --set channel_priority strict
                conda config --system --remove channels defaults
                conda config --system --add channels conda-forge
                conda config --system --add channels nvidia

                conda update -n base conda --yes

                ### create a virtual environment for pytorch
                conda create -n pytorch python=3.10 --yes
                conda activate pytorch
                conda config --env --add channels pytorch
                conda install -n pytorch pytorch torchvision torchaudio pytorch-cuda=12.1 --yes
                pip install -q Cython

  - group: packer
    modules:
      - id: custom-image
        source: modules/packer/custom-image
        kind: packer
        use:
          - network
          - script
        settings:
          # give VM a public IP to ensure startup script can reach public internet
          # w/o new VPC
          omit_external_ip: false
          source_image_project_id: [schedmd-slurm-public]
          # see latest in https://github.com/GoogleCloudPlatform/slurm-gcp/blob/master/docs/images.md#published-image-family
          source_image_family: slurm-gcp-6-5-debian-11
          # You can find size of source image by using following command
          # gcloud compute images describe-from-family <source_image_family> --project schedmd-slurm-public
          disk_size: $(vars.disk_size_gb)
          image_family: $(vars.new_image.family)
          # building this image does not require a GPU-enabled VM
          machine_type: n2-standard-4
          state_timeout: 15m

  - group: cluster
    modules:
      - id: examples
        source: modules/scripts/startup-script
        settings:
          runners:
            - type: data
              destination: /var/tmp/torch_test.sh
              content: |
                #!/bin/bash
                source /etc/profile.d/conda.sh
                conda activate pytorch
                python3 torch_test.py
            - type: data
              destination: /var/tmp/torch_test.py
              content: |
                import torch
                import torch.utils.benchmark as benchmark

                def batched_dot_mul_sum(a, b):
                    '''Computes batched dot by multiplying and summing'''
                    return a.mul(b).sum(-1)

                def batched_dot_bmm(a, b):
                    '''Computes batched dot by reducing to bmm'''
                    a = a.reshape(-1, 1, a.shape[-1])
                    b = b.reshape(-1, b.shape[-1], 1)
                    return torch.bmm(a, b).flatten(-3)

                # use GPU if available, else CPU
                device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                print('Using device:', device)
                if device.type == 'cuda':
                    print(torch.cuda.get_device_name(0))

                # benchmarking
                x = torch.randn(10000, 64)
                t0 = benchmark.Timer(
                    stmt='batched_dot_mul_sum(x, x)',
                    setup='from __main__ import batched_dot_mul_sum',
                    globals={'x': x})
                t1 = benchmark.Timer(
                    stmt='batched_dot_bmm(x, x)',
                    setup='from __main__ import batched_dot_bmm',
                    globals={'x': x})
                print(t0.timeit(100))
                print(t1.timeit(100))

      - id: a100_40g_1_nodeset
        source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        use: [network]
        settings:
          node_count_dynamic_max: 4
          bandwidth_tier: gvnic_enabled
          machine_type: a2-highgpu-1g
          instance_image: $(vars.new_image)
          instance_image_custom: true
          preemptible: true

      - id: a100_40g_1_partition
        source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        use: [a100_40g_1_nodeset]
        settings:
          partition_name: a10040g1gpu
          is_default: true

      - id: a100_40g_4_nodeset
        source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        use: [network]
        settings:
          node_count_dynamic_max: 4
          bandwidth_tier: gvnic_enabled
          machine_type: a2-highgpu-4g
          instance_image: $(vars.new_image)
          instance_image_custom: true
          preemptible: true

      - id: a100_40g_4_partition
        source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        use: [a100_40g_4_nodeset]
        settings:
          partition_name: a10040g4gpu
          is_default: true

      - id: a100_40g_8_nodeset
        source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        use: [network]
        settings:
          node_count_dynamic_max: 4
          bandwidth_tier: gvnic_enabled
          machine_type: a2-highgpu-8g
          instance_image: $(vars.new_image)
          instance_image_custom: true
          preemptible: true

      - id: a100_40g_8_partition
        source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        use: [a100_40g_8_nodeset]
        settings:
          partition_name: a10040g8gpu
          is_default: true

      - id: a100_40g_16_nodeset
        source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        use: [network]
        settings:
          node_count_dynamic_max: 4
          bandwidth_tier: gvnic_enabled
          machine_type: a2-megagpu-16g
          instance_image: $(vars.new_image)
          instance_image_custom: true
          preemptible: true

      - id: a100_40g_16_partition
        source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        use: [a100_40g_16_nodeset]
        settings:
          partition_name: a10040g16gpu
          is_default: true

      - id: g2_nodeset
        source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        use: [network]
        settings:
          node_count_dynamic_max: 20
          enable_placement: false
          bandwidth_tier: gvnic_enabled
          machine_type: g2-standard-4
          instance_image: $(vars.new_image)
          instance_image_custom: true

      - id: g2_partition
        source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        use: [g2_nodeset]
        settings:
          partition_name: g2
          exclusive: false

      - id: slurm_login
        source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
        use: [network]
        settings:
          machine_type: n2-standard-4
          name_prefix: "login"
          enable_login_public_ips: true
          instance_image: $(vars.new_image)
          instance_image_custom: true

      - id: slurm_controller
        source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
        use:
          - network
          - a100_40g_1_partition
          - a100_40g_4_partition
          - a100_40g_8_partition
          - a100_40g_16_partition
          - g2_partition
          - homefs
          - slurm_login
        settings:
          machine_type: n2-standard-4
          enable_controller_public_ips: true
          instance_image: $(vars.new_image)
          instance_image_custom: true
          login_startup_script: $(examples.startup_script)

Additional context

Add any other context about the problem here.

harshthakkar01 commented 3 months ago

Hi,

Can you try specifying it with --gpus=X or --gpus-per-node=Y to srun command when you start a2 instance. You can find the reference here https://slurm.schedmd.com/srun.html#OPT_gpus

msis commented 3 months ago

That solves it. I thought because of the instance type, there was no need to set the gpu.

I can confirm that setting gpus (or gres) does the job and GPUs a revisible.