alexk1028 commented 2 years ago

我们目前正在内部的测试集群上使用这个项目进行试验。集群版本信息如下： Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

docker版本如下： Client: Docker Engine - Community Version: 20.10.10 API version: 1.41 Go version: go1.16.9 Git commit: b485636 Built: Mon Oct 25 07:42:59 2021 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.9 API version: 1.41 (minimum version 1.12) Go version: go1.16.8 Git commit: 79ea9d3 Built: Mon Oct 4 16:06:37 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.11 GitCommit: 5b46e404f6b9f661a205e28d59c982d3634148f8 nvidia: Version: 1.0.2 GitCommit: v1.0.2-0-g52b36a2 docker-init: Version: 0.19.0 GitCommit: de40ad0

部署gpu插件的yaml如下： apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata:

This annotation is deprecated. Kept here for backward compatibility

  # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  labels:
    name: nvidia-device-plugin-ds
spec:
  tolerations:
  # This toleration is deprecated. Kept here for backward compatibility
  # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  - key: CriticalAddonsOnly
    operator: Exists
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  nodeSelector:
    nvidia-device-enable: enable
  # Mark this pod as a critical add-on; when enabled, the critical add-on
  # scheduler reserves resources for critical add-on pods so that they can
  # be rescheduled after a failure.
  # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  priorityClassName: "system-node-critical"
  containers:
  - image: 4pdosc/k8s-device-plugin:latest
    # - image: m7-ieg-pico-test01:5000/k8s-device-plugin-test:v0.9.0-ubuntu20.04
    imagePullPolicy: Always
    name: nvidia-device-plugin-ctr
    args: ["--fail-on-init-error=true", "--device-split-count=3", "--device-memory-scaling=1", "--device-cores-scaling=1"]
    env:
    - name: PCIBUSFILE
      value: "/usr/local/vgpu/pciinfo.vgpu"
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
    volumeMounts:
      - name: device-plugin
        mountPath: /var/lib/kubelet/device-plugins
      - name: vgpu-dir
        mountPath: /usr/local/vgpu
  volumes:
    - name: device-plugin
      hostPath:
        path: /var/lib/kubelet/device-plugins
    - name: vgpu-dir
      hostPath:
        path: /usr/local/vgpu

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

我们发现在集群内成功分割GPU后，启动不同的pod使用vGpu，好像并没有实现显存隔离？并且不同的pod间同时训练时会互相产生影响？请问这是因为我的CUDA版本问题还是因为我们实际上并没有显存隔离？

非常感谢

archlitchi commented 2 years ago

正常情况下是有显存隔离的，不同pod训练互相影响应该是因为互相争夺算力的原因吧，请问你有试过在训练时在容器内部执行nvidia-smi来看它是否用超了吗

alexk1028 commented 2 years ago

正常情况下是有显存隔离的，不同pod训练互相影响应该是因为互相争夺算力的原因吧，请问你有试过在训练时在容器内部执行nvidia-smi来看它是否用超了吗

在容器内我使用watch nvidia-smi来看是没有超过显存限制，感谢。另外询问一下宿主机上执行watch nvidia-smi的结果与容器内执行watch nvidia-smi的结果不一致的原因是为什么呢？

archlitchi commented 2 years ago

因为容器内的显存是插件统计的，所以会跟host有几百M的区别，主要差在有些用来管理上下文的显存nv没有查询的接口所以统计不到

alexk1028 commented 2 years ago

因为容器内的显存是插件统计的，所以会跟host有几百M的区别，主要差在有些用来管理上下文的显存nv没有查询的接口所以统计不到

感谢，完美的解决了我的疑惑

4paradigm / k8s-vgpu-scheduler

显存隔离 #12

This annotation is deprecated. Kept here for backward compatibility