NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.69k stars 278 forks source link

nvidia-smi killed after a while #271

Open sandrich opened 2 years ago

sandrich commented 2 years ago

I run a rapidsai container with jupyter notebook. When I freshly start the container all is fine. I can run some GPU workload inside the notebook.

Thu Oct 14 09:58:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      On   | 00000000:13:00.0 Off |                   On |
| N/A   37C    P0    65W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Then randomly the notebook kernel gets killed. When I check nvidia-smi it crashes

nvidia-smi
Thu Oct 14 09:59:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
Killed

I am not sure how to further debug this issue and where this comes from?

Environment: OpenShift 4.7 GPU: Nvidia A100, MIG mode using the mig manager Operator: 1.7.1

ClusterPolicy

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  migManager:
    nodeSelector:
      nvidia.com/gpu.deploy.mig-manager: 'true'
    enabled: true
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/cloud-native
    env:
      - name: WITH_REBOOT
        value: 'true'
    securityContext: {}
    version: 'sha256:495ed3b42e0541590c537ab1b33bda772aad530d3ef6a4f9384d3741a59e2bf8'
    image: k8s-mig-manager
    tolerations: []
    priorityClassName: system-node-critical
  operator:
    defaultRuntime: crio
    initContainer:
      image: cuda
      imagePullSecrets: []
      repository: nexus.bisinfo.org:8088/nvidia
      version: 'sha256:ba39801ba34370d6444689a860790787ca89e38794a11952d89a379d2e9c87b5'
    deployGFD: true
  gfd:
    nodeSelector:
      nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia
    env:
      - name: GFD_SLEEP_INTERVAL
        value: 60s
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
    securityContext: {}
    version: 'sha256:bfc39d23568458dfd50c0c5323b6d42bdcd038c420fb2a2becd513a3ed3be27f'
    image: gpu-feature-discovery
    tolerations: []
    priorityClassName: system-node-critical
  dcgmExporter:
    nodeSelector:
      nvidia.com/gpu.deploy.dcgm-exporter: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/k8s
    securityContext: {}
    version: 'sha256:8af02463a8b60b21202d0bf69bc1ee0bb12f684fa367f903d138df6cacc2d0ac'
    image: dcgm-exporter
    tolerations: []
    priorityClassName: system-node-critical
  driver:
    licensingConfig:
      configMapName: 'licensing-config'
    nodeSelector:
      nvidia.com/gpu.deploy.driver: 'true'
    enabled: true
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia
    securityContext: {}
    repoConfig:
      configMapName: repo-config
      destinationDir: "/etc/yum.repos.d"
    version: 'sha256:09ba3eca64a80fab010a9fcd647a2675260272a8c3eb515dfed6dc38a2d31ead'
    image: driver
    tolerations: []
    priorityClassName: system-node-critical
  devicePlugin:
    nodeSelector:
      nvidia.com/gpu.deploy.device-plugin: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia
    env:
      - name: PASS_DEVICE_SPECS
        value: 'true'
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
      - name: DEVICE_LIST_STRATEGY
        value: envvar
      - name: DEVICE_ID_STRATEGY
        value: uuid
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    securityContext: {}
    version: 'sha256:85def0197f388e5e336b1ab0dbec350816c40108a58af946baa1315f4c96ee05'
    image: k8s-device-plugin
    tolerations: []
    args: []
    priorityClassName: system-node-critical
  mig:
    strategy: single
  validator:
    nodeSelector:
      nvidia.com/gpu.deploy.operator-validator: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/cloud-native
    env:
      - name: WITH_WORKLOAD
        value: 'true'
    securityContext: {}
    version: 'sha256:2bb62b9ca89bf9ae26399eeeeaf920d7752e617fa070c1120bf800253f624a10'
    image: gpu-operator-validator
    tolerations: []
    priorityClassName: system-node-critical
  toolkit:
    nodeSelector:
      nvidia.com/gpu.deploy.container-toolkit: 'true'
    enabled: true
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/k8s
    securityContext: {}
    version: 1.5.0-ubi8
    image: container-toolkit
    tolerations: []
    priorityClassName: system-node-critical

Any idea how to debug where this issue comes from? Also we need 11.2 support I suppose we cannot go with a newer toolkit image?

elezar commented 2 years ago

Hi @sandrich. Thanks for reporting this. With regards to the toolkit version, this is independent of the CUDA version which is determined by the driver that is installed on the system (in the case of the GPU Operator most likely by the driver container).

@klueska I recall that due to the following runc bug we saw that long running containers would lose access to devices. Do you recall what our workaround was?

Update: The runc bug was triggered due to CPUManager issuing an update command for the container's CPU set every 10s irrespective as to whether changes were required. Our workaround was to patch CPUManager to only issue an update if something had changed. The changes have been merged into upstream 1.22 but I am uncertain of the backport status.

klueska commented 2 years ago

The heavy-duty workaround is to update to a version of Kubernetes that contains this patch: https://github.com/kubernetes/kubernetes/pull/101771

The lighter-weight workaround would be to make sure that your pod requests a set of exclusive CPUs as described here (even just one exclusive CPU would be sufficient): https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/

sandrich commented 2 years ago

@klueska that is to add a request section of at least 1 full core like so?

resources:
      requests:
        cpu: 1

The following resources were set in the test deployment

resources:
          limits:
            cpu: "1"
            memory: 1000Mi
            nvidia.com/gpu: "1"
          requests:
            cpu: "1"
            memory: 1000Mi
            nvidia.com/gpu: "1"
klueska commented 2 years ago

Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)?

sandrich commented 2 years ago

Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)?

Exactly. The node has cpuManagerPolicy set to static

cat /etc/kubernetes/kubelet.conf | grep cpu
  "cpuManagerPolicy": "static",
  "cpuManagerReconcilePeriod": "5s",

And here the pod details

oc describe pod rapidsai-998589866-dkltb
Name:         rapidsai-998589866-dkltb
Namespace:    med-gpu-python-dev
Priority:     0
Node:         adchio1011.ocp-dev.opz.bisinfo.org/10.20.12.21
Start Time:   Fri, 15 Oct 2021 14:48:40 +0200
Labels:       app=rapidsai
              deployment=rapidsai
              pod-template-hash=998589866
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "100.70.4.26"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "100.70.4.26"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: restricted
Status:       Running
IP:           100.70.4.26
IPs:
  IP:           100.70.4.26
Controlled By:  ReplicaSet/rapidsai-998589866
Containers:
  rapidsai:
    Container ID:  cri-o://bbf668d97da94e3a8de9b8df79a6c65ce7fa0c61026e060ce56afbcfc08b862d
    Image:         quay.bisinfo.org/by003457/r2106_cuda112_base_cent8-py37:latest
    Image ID:      quay.bisinfo.org/by003457/r2106_cuda112_base_cent8-py37@sha256:10cc2b92ae96a6f402c0b9ad6901c00cd9b3d37b5040fd2ba8e6fc8b279bb06c
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/conda/envs/rapids/bin/jupyter-lab
      --allow-root
      --notebook-dir=/var/jupyter/notebook
      --ip=0.0.0.0
      --no-browser
      --NotebookApp.token=''
      --NotebookApp.allow_origin="*"
    State:          Running
      Started:      Fri, 15 Oct 2021 14:48:44 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          1000Mi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          1000Mi
      nvidia.com/gpu:  1
    Environment:
      HOME:  /tmp
    Mounts:
      /var/jupyter/notebook from jupyter-notebook (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6g9vj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  jupyter-notebook:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  notebook
    ReadOnly:   false
  default-token-6g9vj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6g9vj
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
klueska commented 2 years ago

OK. Yeah, everything looks good from the perspective of the pod specs, etc.

I’m guessing you must be running into the runc bug then: https://github.com/opencontainers/runc/issues/2366#issue-609480075

And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: https://github.com/kubernetes/kubernetes/pull/101771

I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not.

sandrich commented 2 years ago

OK. Yeah, everything looks good from the perspective of the pod specs, etc.

I’m guessing you must be running into the runc bug then: opencontainers/runc#2366 (comment)

And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: kubernetes/kubernetes#101771

I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not.

Hi, OpenShift does not use runc but rather cri-o?

sandrich commented 2 years ago

Also what we see is the following in the logs of the node

[14136.622417] cuda-EvtHandlr invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=-997
[14136.622588] CPU: 1 PID: 711806 Comm: cuda-EvtHandlr Tainted: P           OE    --------- -  - 4.18.0-305.19.1.el8_4.x86_64 #1
[14136.622781] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.17369862.B64.2012240522 12/24/2020 [14136.622987] Call Trace:
[14136.623038]  dump_stack+0x5c/0x80
[14136.623103]  dump_header+0x4a/0x1db
[14136.623168]  oom_kill_process.cold.32+0xb/0x10 [14136.623252]  out_of_memory+0x1ab/0x4a0 [14136.623322]  mem_cgroup_out_of_memory+0xe8/0x100
[14136.623406]  try_charge+0x65a/0x690
[14136.623470]  mem_cgroup_charge+0xca/0x220 [14136.623543]  __add_to_page_cache_locked+0x368/0x3d0
[14136.623632]  ? scan_shadow_nodes+0x30/0x30 [14136.623706]  add_to_page_cache_lru+0x4a/0xc0 [14136.623784]  iomap_readpages_actor+0x103/0x230 [14136.623865]  iomap_apply+0xfb/0x330 [14136.623930]  ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624010]  ? __blk_mq_run_hw_queue+0x51/0xd0 [14136.624092]  ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624172]  iomap_readpages+0xa8/0x1e0 [14136.624242]  ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624322]  read_pages+0x6b/0x190 [14136.624385]  __do_page_cache_readahead+0x1c1/0x1e0
[14136.624470]  filemap_fault+0x783/0xa20 [14136.624538]  ? __mod_memcg_lruvec_state+0x21/0x100
[14136.624625]  ? page_add_file_rmap+0xef/0x130 [14136.624702]  ? alloc_set_pte+0x21c/0x440 [14136.624779]  ? _cond_resched+0x15/0x30 [14136.624885]  __xfs_filemap_fault+0x6d/0x200 [xfs] [14136.624971]  __do_fault+0x36/0xd0 [14136.625033]  __handle_mm_fault+0xa7a/0xca0 [14136.625108]  handle_mm_fault+0xc2/0x1d0 [14136.625178]  __do_page_fault+0x1ed/0x4c0 [14136.625249]  do_page_fault+0x37/0x130 [14136.625316]  ? page_fault+0x8/0x30 [14136.625379]  page_fault+0x1e/0x30 [14136.625440] RIP: 0033:0x7fbd5b2b00e0 [14136.625508] Code: Unable to access opcode bytes at RIP 0x7fbd5b2b00b6."

I wonder if 16gb memory is not enough for the node that is serving the A100 card. It is a VM on VMWare with Direct Passthrough. We are not using vGPU

shivamerla commented 2 years ago

@sandrich did you try it out with increased memory mapped to VM?

sandrich commented 2 years ago

@shivamerla I did which did not change anything. What did change is adding more memory to the container

shivamerla commented 2 years ago

@sandrich can you check if below settings are enabled on your VM:

pciPassthru.use64bitMMIO=”TRUE”
pciPassthru.64bitMMIOSizeGB=128
sandrich commented 2 years ago

Yes this one is set

image