GoogleCloudPlatform / container-engine-accelerators

Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Apache License 2.0
211 stars 150 forks source link

is there a solution to make all gpu deveices visible for a pod which not requests `nvidia.com/gpu` #239

Open tingweiwu opened 2 years ago

tingweiwu commented 2 years ago

when I use NVIDIA/k8s-device-plugin in my k8s cluster I set NVIDIA_VISIBLE_DEVICES=all in pod spec

apiVersion: v1
kind: Pod
metadata:
  name: test
  containers:
  - args:
    - -c
    - top -b
    command:
    - /bin/sh
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    image: cuda:10.2-cudnn7-devel-ubuntu18.04
    name: test
    resources:
      limits:
        cpu: 150m
        memory: 200Mi
      requests:
        cpu: 100m
        memory: 200Mi

the devices.list under /sys/fs/cgroup/devices/kubepods/burstable/podxxxxxx/xxxxxx/devices.list has all gpu deveice on this node image

I noticed that this GCE container-engine-accelerators doesn’t require using nvidia-docker. so NVIDIA_VISIBLE_DEVICES may doesn't work. thus, is there a solution to make all gpu deveices visible for a pod which not requests nvidia.com/gpu ?

DavraYoung commented 1 year ago

Check how gke time slicing works. I was able to achieve sharing single gpu on multiple workload.

Here is my terraform:


resource "google_container_node_pool" "gpu" {
  name     = "gpu"
  location = var.zone
  cluster  = var.cluster_name
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }
  initial_node_count = 1

  management {
    auto_repair  = "true"
    auto_upgrade = "true"
  }

  node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
    ]
    guest_accelerator {
      type  = var.gpu_type
      count = 1
      gpu_sharing_config{
        gpu_sharing_strategy = "TIME_SHARING"
        max_shared_clients_per_gpu = 2
      }

    }
    image_type = "UBUNTU_CONTAINERD"

    labels = {
      env        = var.project
      node-group = "gpu"
      "cloud.google.com/gke-max-shared-clients-per-node" = "2"
    }

    preemptible  = true
    machine_type = "n1-standard-4"
    tags         = ["gke-node", "${var.cluster_name}-gke"]
    metadata     = {
      disable-legacy-endpoints = "true"
    }
  }
}```

Notice: cloud.google.com/gke-max-shared-clients-per-node
VelorumS commented 11 months ago

@DavraYoung but how to do it without time-sharing or multi-instance GPUs?

We were able to have all GPUs visible to all Docker containers running on the instance.

And it seems that in k8s the nvidia.com/gpu: 0 worked: http://www.bytefold.com/sharing-gpu-in-kubernetes/

You can set nvidia.com/gpu value to 0 and still workload will be able to see all the GPUs available on the instance. It will also not block the GPU on kubernetes to more workloads can be scheduled on that node.

resources:
       limits:
         nvidia.com/gpu: 0 # This will work fine and will not block your GPU for other workloads.