4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
489 stars 93 forks source link

显存隔离 #12

Closed alexk1028 closed 2 years ago

alexk1028 commented 2 years ago

我们目前正在内部的测试集群上使用这个项目进行试验。 集群版本信息如下: Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

docker版本如下: Client: Docker Engine - Community Version: 20.10.10 API version: 1.41 Go version: go1.16.9 Git commit: b485636 Built: Mon Oct 25 07:42:59 2021 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.9 API version: 1.41 (minimum version 1.12) Go version: go1.16.8 Git commit: 79ea9d3 Built: Mon Oct 4 16:06:37 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.11 GitCommit: 5b46e404f6b9f661a205e28d59c982d3634148f8 nvidia: Version: 1.0.2 GitCommit: v1.0.2-0-g52b36a2 docker-init: Version: 0.19.0 GitCommit: de40ad0

部署gpu插件的yaml如下: apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata:

This annotation is deprecated. Kept here for backward compatibility

  # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  labels:
    name: nvidia-device-plugin-ds
spec:
  tolerations:
  # This toleration is deprecated. Kept here for backward compatibility
  # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  - key: CriticalAddonsOnly
    operator: Exists
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  nodeSelector:
    nvidia-device-enable: enable
  # Mark this pod as a critical add-on; when enabled, the critical add-on
  # scheduler reserves resources for critical add-on pods so that they can
  # be rescheduled after a failure.
  # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  priorityClassName: "system-node-critical"
  containers:
  - image: 4pdosc/k8s-device-plugin:latest
    # - image: m7-ieg-pico-test01:5000/k8s-device-plugin-test:v0.9.0-ubuntu20.04
    imagePullPolicy: Always
    name: nvidia-device-plugin-ctr
    args: ["--fail-on-init-error=true", "--device-split-count=3", "--device-memory-scaling=1", "--device-cores-scaling=1"]
    env:
    - name: PCIBUSFILE
      value: "/usr/local/vgpu/pciinfo.vgpu"
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
    volumeMounts:
      - name: device-plugin
        mountPath: /var/lib/kubelet/device-plugins
      - name: vgpu-dir
        mountPath: /usr/local/vgpu
  volumes:
    - name: device-plugin
      hostPath:
        path: /var/lib/kubelet/device-plugins
    - name: vgpu-dir
      hostPath:
        path: /usr/local/vgpu

GPU驱动相关信息如下: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 970 Off | 00000000:01:00.0 Off | N/A | | 36% 33C P8 27W / 200W | 0MiB / 4041MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

我们发现在集群内成功分割GPU后,启动不同的pod使用vGpu,好像并没有实现显存隔离?并且不同的pod间同时训练时会互相产生影响?请问这是因为我的CUDA版本问题还是因为我们实际上并没有显存隔离?

非常感谢

archlitchi commented 2 years ago

正常情况下是有显存隔离的,不同pod训练互相影响应该是因为互相争夺算力的原因吧,请问你有试过在训练时在容器内部执行nvidia-smi来看它是否用超了吗

alexk1028 commented 2 years ago

正常情况下是有显存隔离的,不同pod训练互相影响应该是因为互相争夺算力的原因吧,请问你有试过在训练时在容器内部执行nvidia-smi来看它是否用超了吗

在容器内我使用watch nvidia-smi来看是没有超过显存限制,感谢。另外询问一下宿主机上执行watch nvidia-smi的结果与容器内执行watch nvidia-smi的结果不一致的原因是为什么呢?

archlitchi commented 2 years ago

因为容器内的显存是插件统计的,所以会跟host有几百M的区别,主要差在有些用来管理上下文的显存nv没有查询的接口所以统计不到

alexk1028 commented 2 years ago

因为容器内的显存是插件统计的,所以会跟host有几百M的区别,主要差在有些用来管理上下文的显存nv没有查询的接口所以统计不到

感谢,完美的解决了我的疑惑