AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 308 forks source link

分配8G 但是实际使用不会限制在8G内 #158

Open momomobinx opened 3 years ago

momomobinx commented 3 years ago

我在自己的服务器上部署了k8s 单节点,然后按照教程安装配置了 阿里gpu 插件 ,安装部署kubeflow ,在里面创建notebook 配置 8G 显存,但是在实际使用中还是会占用到一张显卡的几乎全部内存。还是是说在没有其他任务的情况下就会直接占满使用?

nvidia-smi 查看信息

root@chnlj-Super-Server:~# nvidia-smi
Wed Aug 25 14:20:43 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84       Driver Version: 460.84       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
| 13%   44C    P2    60W / 257W |  10571MiB / 11019MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:03:00.0 Off |                  N/A |
| 13%   32C    P8     2W / 257W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:82:00.0 Off |                  N/A |
| 13%   31C    P8    21W / 257W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9750      C   /opt/conda/bin/python3            471MiB |
|    0   N/A  N/A     27519      C   /opt/conda/bin/python3          10097MiB |
+-----------------------------------------------------------------------------+

kubectl-inspect-gpushare 查看信息


root@chnlj-Super-Server:/kubeflowData/pv/pv5# kubectl-inspect-gpushare 
NAME                IPADDRESS     GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU Memory(GiB)
chnlj-super-server  172.16.15.34  8/10                   0/10                   0/10                   8/30
------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
8/30 (26%) 

kubectl describe pod luk-test-gpu-notebook-0 -nkubeflow-user-example-com

root@chnlj-Super-Server:~# kubectl describe pod luk-test-gpu-notebook-0 -nkubeflow-user-example-com
Name:         luk-test-gpu-notebook-0
Namespace:    kubeflow-user-example-com
Priority:     0
Node:         chnlj-super-server/172.16.15.34
Start Time:   Wed, 25 Aug 2021 14:02:27 +0800
Labels:       app=luk-test-gpu-notebook
              controller-revision-hash=luk-test-gpu-notebook-68745bcf4c
              istio.io/rev=default
              notebook-name=luk-test-gpu-notebook
              security.istio.io/tlsMode=istio
              service.istio.io/canonical-name=luk-test-gpu-notebook
              service.istio.io/canonical-revision=latest
              statefulset=luk-test-gpu-notebook
              statefulset.kubernetes.io/pod-name=luk-test-gpu-notebook-0
Annotations:  ALIYUN_COM_GPU_MEM_ASSIGNED: true
              ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1629871347742396836
              ALIYUN_COM_GPU_MEM_DEV: 10
              ALIYUN_COM_GPU_MEM_IDX: 0
              ALIYUN_COM_GPU_MEM_POD: 8
              kubectl.kubernetes.io/default-logs-container: luk-test-gpu-notebook
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 15020
              prometheus.io/scrape: true
              sidecar.istio.io/status:
                {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istiod-ca-cert"],"ima...
Status:       Running
IP:           10.244.0.168
IPs:
  IP:           10.244.0.168
Controlled By:  StatefulSet/luk-test-gpu-notebook
Init Containers:
  istio-init:
    Container ID:  docker://fb5d0fdf3b74d116decffc03b0949a3df426dadfca4b7deefd76eaaffd7916a2
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 25 Aug 2021 14:02:32 +0800
      Finished:     Wed, 25 Aug 2021 14:02:33 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        10m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-editor-token-7h5c7 (ro)
Containers:
  luk-test-gpu-notebook:
    Container ID:   docker://459637e1f4fdd52a06840dcd09aa19c4ed662c9a4bb005673a044b0d0b7cc948
    Image:          public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-tensorflow-cuda-full:v1.3.0-rc.0
    Image ID:       docker-pullable://public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-tensorflow-cuda-full@sha256:4b3f2dbf8fca0de3451a98d628700e4249e2a21ccb52db1853d4a2904e31e9a2
    Port:           8888/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 25 Aug 2021 14:02:34 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      aliyun.com/gpu-mem:  8
    Requests:
      aliyun.com/gpu-mem:  8
      cpu:                 1
      memory:              8Gi
    Environment:
      NB_PREFIX:  /notebook/kubeflow-user-example-com/luk-test-gpu-notebook
    Mounts:
      /dev/shm from dshm (rw)
      /home/jovyan from luk-test-gpu-notebook-ws (rw)
      /home/jovyan/kubeflow-data from kubeflow-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-editor-token-7h5c7 (ro)
  istio-proxy:
    Container ID:  docker://9e5e4a6495b3ab40a9b23e9de0b7a375f3e8c4539f6ea03501c96a5204961018
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      luk-test-gpu-notebook.$(POD_NAMESPACE)
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Running
      Started:      Wed, 25 Aug 2021 14:02:38 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    first-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      luk-test-gpu-notebook-0 (v1:metadata.name)
      POD_NAMESPACE:                 kubeflow-user-example-com (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {}

      ISTIO_META_POD_PORTS:          [
                                         {"name":"notebook-port","containerPort":8888,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     luk-test-gpu-notebook
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      luk-test-gpu-notebook
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/kubeflow-user-example-com/statefulsets/luk-test-gpu-notebook
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-editor-token-7h5c7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  luk-test-gpu-notebook-ws:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  luk-test-gpu-notebook-ws
    ReadOnly:   false
  kubeflow-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  kubeflow-data
    ReadOnly:   false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  default-editor-token-7h5c7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-editor-token-7h5c7
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  23m   default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling  23m   default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled         23m   default-scheduler  Successfully assigned kubeflow-user-example-com/luk-test-gpu-notebook-0 to chnlj-super-server
  Normal   Pulling           23m   kubelet            Pulling image "docker.io/istio/proxyv2:1.9.0"
  Normal   Pulled            22m   kubelet            Successfully pulled image "docker.io/istio/proxyv2:1.9.0" in 3.956645911s
  Normal   Created           22m   kubelet            Created container istio-init
  Normal   Started           22m   kubelet            Started container istio-init
  Normal   Pulled            22m   kubelet            Container image "public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-tensorflow-cuda-full:v1.3.0-rc.0" already present on machine
  Normal   Created           22m   kubelet            Created container luk-test-gpu-notebook
  Normal   Started           22m   kubelet            Started container luk-test-gpu-notebook
  Normal   Pulling           22m   kubelet            Pulling image "docker.io/istio/proxyv2:1.9.0"
  Normal   Pulled            22m   kubelet            Successfully pulled image "docker.io/istio/proxyv2:1.9.0" in 4.057881696s
  Normal   Created           22m   kubelet            Created container istio-proxy
  Normal   Started           22m   kubelet            Started container istio-proxy
nicozhang commented 3 years ago

nvidia-smi 显示的内存好像不是真实占用的。之前我也遇到这个问题了。TF2.0 默认情况下会申请所有的剩余的显存,但不一定会用到。后来设置了动态申请,用多少申请多少。这时候,如果真实使用的内存超过 gpushare 分配的内存,cgroup 会将进程杀掉。

momomobinx commented 3 years ago

nvidia-smi 显示的内存好像不是真实占用的。之前我也遇到这个问题了。TF2.0 默认情况下会申请所有的剩余的显存,但不一定会用到。后来设置了动态申请,用多少申请多少。这时候,如果真实使用的内存超过 gpushare 分配的内存,cgroup 会将进程杀掉。

按照官方示例上展示的是,gpushare分配 3G 然后 在 tf2.0 中使用 nvidia-smi 看到的就是 3G 然后使用也不会超过 这个 3G,然后在宿主机上查看 (假设宿主机是10G) 应该也是只占用 3G 总的 10G

momomobinx commented 3 years ago

nvidia-smi 显示的内存好像不是真实占用的。之前我也遇到这个问题了。TF2.0 默认情况下会申请所有的剩余的显存,但不一定会用到。后来设置了动态申请,用多少申请多少。这时候,如果真实使用的内存超过 gpushare 分配的内存,cgroup 会将进程杀掉。

参考这个实例 运行GPU共享实例

wsxiaozhang commented 2 years ago

gpushare scheduler负责按照显存维度为单位,在集群中去调度作业,也就是找到哪个node上的哪块GPU卡还能提供作业所需显存大小。作业pod被调度到node上,会绑定合适的GPU卡到容器内。此时调度就完成了。 如果需要在容器内限制进程实际使用的显存量,还需要配合GPU隔离,这个就不在调度器的能力里了。 实现node上单GPU卡显存隔离的方案可以参考阿里云的cGPU,或Nivdia的MPS,或Nvidia A100的MIG等等

AntyRia commented 11 months ago

阿里云的cGPU

你好,那我想知道yaml文件中的配置文件起什么作用呢?

hiahia121 commented 9 months ago

gpu_mem_no_limit

实际使用没有被限制住

ferris-cx commented 2 months ago

gpushare scheduler负责按照显存维度为单位,在集群中去调度作业,也就是找到哪个node上的哪块GPU卡还能提供作业所需显存大小。作业pod被调度到node上,会绑定合适的GPU卡到容器内。此时调度就完成了。 如果需要在容器内限制进程实际使用的显存量,还需要配合GPU隔离,这个就不在调度器的能力里了。 实现node上单GPU卡显存隔离的方案可以参考阿里云的cGPU,或Nivdia的MPS,或Nvidia A100的MIG等等

阿里cGPU方案有开源吗?

Kevin-Zhang-SYSU commented 2 months ago

已收到,谢谢!

fenwuyaoji commented 2 months ago

已收到您的邮件,我将及时查看并回复,谢谢                                                                                                                     王鑫

ferris-cx commented 2 months ago

这个项目应该只是利用K8s的设备插件机制上报GPU资源,包括卡数和显存,再利用k8s的调度扩展机制自定义个调度器调度到某个node节点的某个卡上而已。至于限制那是类似cGroups机制实现。应该是kernel+GPu Driver层面去优雅的实现。或者用户态的CUDA劫持去实现。个人还是倾向于前者。或者Nvidia的MIG方案。但MIG支持显卡种类有限。