NTHU-LSALAB / KubeShare

Share GPU between Pods in Kubernetes
Apache License 2.0
193 stars 42 forks source link

some problems while testing resource isolation #19

Open y-ykcir opened 1 year ago

y-ykcir commented 1 year ago

Hi, I met some problems while testing resource isolation. The KubeShare seems to be running normally, but the isolation specified by annotation fails to achieve the expected effect.

My Enviornment

Resource isolation test

shared pod file:

apiVersion: kubeshare.nthu/v1
kind: SharePod
metadata:
  name: sharepod1
  annotations:
    "kubeshare/gpu_request": "0.5"
    "kubeshare/gpu_limit": "0.6"
    "kubeshare/gpu_mem": "10485760000"
spec:
  terminationGracePeriodSeconds: 0
  containers:
  - name: tensorflow-benchmark
    image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.4.0
    command:
    - bash
    - run.sh
    - --num_batches=50000
    - --batch_size=8
    workingDir: /root

kubectl get pod -A

NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE
default       sharepod1                                  1/1     Running   0          3m39s
kube-system   calico-kube-controllers-7854b85cf7-sd5fw   1/1     Running   0          2d10h
kube-system   calico-node-ccdcp                          1/1     Running   0          2d10h
kube-system   coredns-54d67798b7-f5fv8                   1/1     Running   0          2d10h
kube-system   coredns-54d67798b7-rlvhg                   1/1     Running   0          2d10h
kube-system   etcd-k8s-master                            1/1     Running   0          2d10h
kube-system   kube-apiserver-k8s-master                  1/1     Running   0          2d10h
kube-system   kube-controller-manager-k8s-master         1/1     Running   0          2d10h
kube-system   kube-proxy-lz6jn                           1/1     Running   0          2d10h
kube-system   kube-scheduler-k8s-master                  1/1     Running   0          2d10h
kube-system   kubeshare-device-manager                   1/1     Running   0          2d10h
kube-system   kubeshare-node-daemon-f58tc                2/2     Running   0          2d10h
kube-system   kubeshare-scheduler                        1/1     Running   0          2d10h
kube-system   kubeshare-vgpu-k8s-master-gzwvx            1/1     Running   0          3m40s
kube-system   nvidia-device-plugin-daemonset-twghw       1/1     Running   0          2d10h

kubectl logs sharepod1 seems to be working

INFO:tensorflow:Running local_init_op.
I1010 01:02:41.913408 140019771205440 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1010 01:02:41.943301 140019771205440 session_manager.py:508] Done running local_init_op.
2022-10-10 01:02:42.579839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-10-10 01:04:04.418663: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-10-10 01:17:41.610889: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
TensorFlow:  2.2
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  8 global
             8 per device
Num batches: 50000
Num epochs:  0.31
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Time    Step    Img/sec total_loss
2022-10-10 01:18        1       images/sec: 354.7 +/- 0.0 (jitter = 0.0)        nan
2022-10-10 01:18        10      images/sec: 355.0 +/- 0.4 (jitter = 0.8)        nan
2022-10-10 01:18        20      images/sec: 354.8 +/- 0.3 (jitter = 1.3)        nan
2022-10-10 01:18        30      images/sec: 354.7 +/- 0.2 (jitter = 1.2)        nan
2022-10-10 01:18        40      images/sec: 354.7 +/- 0.2 (jitter = 1.2)        nan
2022-10-10 01:18        50      images/sec: 66.9 +/- 7.0 (jitter = 1.4) nan
2022-10-10 01:18        60      images/sec: 77.3 +/- 5.8 (jitter = 1.4) nan
2022-10-10 01:18        70      images/sec: 87.0 +/- 5.0 (jitter = 1.4) nan
2022-10-10 01:18        80      images/sec: 96.0 +/- 4.4 (jitter = 1.3) nan
2022-10-10 01:18        90      images/sec: 104.5 +/- 3.9 (jitter = 1.4)        nan
2022-10-10 01:18        100     images/sec: 112.4 +/- 3.5 (jitter = 1.4)        nan
2022-10-10 01:18        110     images/sec: 119.8 +/- 3.2 (jitter = 1.5)        nan
2022-10-10 01:18        120     images/sec: 126.8 +/- 2.9 (jitter = 1.3)        nan
2022-10-10 01:18        130     images/sec: 133.4 +/- 2.7 (jitter = 1.4)        nan
2022-10-10 01:19        140     images/sec: 139.6 +/- 2.5 (jitter = 1.4)        nan
2022-10-10 01:19        150     images/sec: 145.5 +/- 2.4 (jitter = 1.5)        nan
2022-10-10 01:19        160     images/sec: 151.0 +/- 2.2 (jitter = 1.5)        nan
2022-10-10 01:19        170     images/sec: 156.3 +/- 2.1 (jitter = 1.5)        nan

However, the annotation of resource isolation does not seem to be effective

nvidia-smi on host

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  Off |
| 52%   81C    P2   328W / 450W |   8409MiB / 24564MiB |     96%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     83363      C   python                           8407MiB |
+-----------------------------------------------------------------------------+

nvidia-smi in the sharepod1

root@sharepod1:~# nvidia-smi
Mon Oct 10 01:43:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  Off |
| 88%   83C    P2   332W / 450W |   8409MiB / 24564MiB |     95%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I want to ask what the problem is