Open sandrich opened 2 years ago
Hi @sandrich. Thanks for reporting this. With regards to the toolkit version, this is independent of the CUDA version which is determined by the driver that is installed on the system (in the case of the GPU Operator most likely by the driver container).
@klueska I recall that due to the following runc
bug we saw that long running containers would lose access to devices. Do you recall what our workaround was?
Update: The runc
bug was triggered due to CPUManager
issuing an update command for the container's CPU set every 10s irrespective as to whether changes were required. Our workaround was to patch CPUManager to only issue an update if something had changed. The changes have been merged into upstream 1.22 but I am uncertain of the backport status.
The heavy-duty workaround is to update to a version of Kubernetes that contains this patch: https://github.com/kubernetes/kubernetes/pull/101771
The lighter-weight workaround would be to make sure that your pod requests a set of exclusive CPUs as described here (even just one exclusive CPU would be sufficient): https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/
@klueska that is to add a request section of at least 1 full core like so?
resources:
requests:
cpu: 1
The following resources were set in the test deployment
resources:
limits:
cpu: "1"
memory: 1000Mi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 1000Mi
nvidia.com/gpu: "1"
Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)?
Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)?
Exactly. The node has cpuManagerPolicy set to static
cat /etc/kubernetes/kubelet.conf | grep cpu
"cpuManagerPolicy": "static",
"cpuManagerReconcilePeriod": "5s",
And here the pod details
oc describe pod rapidsai-998589866-dkltb
Name: rapidsai-998589866-dkltb
Namespace: med-gpu-python-dev
Priority: 0
Node: adchio1011.ocp-dev.opz.bisinfo.org/10.20.12.21
Start Time: Fri, 15 Oct 2021 14:48:40 +0200
Labels: app=rapidsai
deployment=rapidsai
pod-template-hash=998589866
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"100.70.4.26"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"100.70.4.26"
],
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
Status: Running
IP: 100.70.4.26
IPs:
IP: 100.70.4.26
Controlled By: ReplicaSet/rapidsai-998589866
Containers:
rapidsai:
Container ID: cri-o://bbf668d97da94e3a8de9b8df79a6c65ce7fa0c61026e060ce56afbcfc08b862d
Image: quay.bisinfo.org/by003457/r2106_cuda112_base_cent8-py37:latest
Image ID: quay.bisinfo.org/by003457/r2106_cuda112_base_cent8-py37@sha256:10cc2b92ae96a6f402c0b9ad6901c00cd9b3d37b5040fd2ba8e6fc8b279bb06c
Port: <none>
Host Port: <none>
Command:
/opt/conda/envs/rapids/bin/jupyter-lab
--allow-root
--notebook-dir=/var/jupyter/notebook
--ip=0.0.0.0
--no-browser
--NotebookApp.token=''
--NotebookApp.allow_origin="*"
State: Running
Started: Fri, 15 Oct 2021 14:48:44 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 1000Mi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 1000Mi
nvidia.com/gpu: 1
Environment:
HOME: /tmp
Mounts:
/var/jupyter/notebook from jupyter-notebook (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6g9vj (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
jupyter-notebook:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: notebook
ReadOnly: false
default-token-6g9vj:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-6g9vj
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
OK. Yeah, everything looks good from the perspective of the pod specs, etc.
I’m guessing you must be running into the runc bug then: https://github.com/opencontainers/runc/issues/2366#issue-609480075
And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: https://github.com/kubernetes/kubernetes/pull/101771
I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not.
OK. Yeah, everything looks good from the perspective of the pod specs, etc.
I’m guessing you must be running into the runc bug then: opencontainers/runc#2366 (comment)
And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: kubernetes/kubernetes#101771
I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not.
Hi, OpenShift does not use runc but rather cri-o?
Also what we see is the following in the logs of the node
[14136.622417] cuda-EvtHandlr invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=-997
[14136.622588] CPU: 1 PID: 711806 Comm: cuda-EvtHandlr Tainted: P OE --------- - - 4.18.0-305.19.1.el8_4.x86_64 #1
[14136.622781] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.17369862.B64.2012240522 12/24/2020 [14136.622987] Call Trace:
[14136.623038] dump_stack+0x5c/0x80
[14136.623103] dump_header+0x4a/0x1db
[14136.623168] oom_kill_process.cold.32+0xb/0x10 [14136.623252] out_of_memory+0x1ab/0x4a0 [14136.623322] mem_cgroup_out_of_memory+0xe8/0x100
[14136.623406] try_charge+0x65a/0x690
[14136.623470] mem_cgroup_charge+0xca/0x220 [14136.623543] __add_to_page_cache_locked+0x368/0x3d0
[14136.623632] ? scan_shadow_nodes+0x30/0x30 [14136.623706] add_to_page_cache_lru+0x4a/0xc0 [14136.623784] iomap_readpages_actor+0x103/0x230 [14136.623865] iomap_apply+0xfb/0x330 [14136.623930] ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624010] ? __blk_mq_run_hw_queue+0x51/0xd0 [14136.624092] ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624172] iomap_readpages+0xa8/0x1e0 [14136.624242] ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624322] read_pages+0x6b/0x190 [14136.624385] __do_page_cache_readahead+0x1c1/0x1e0
[14136.624470] filemap_fault+0x783/0xa20 [14136.624538] ? __mod_memcg_lruvec_state+0x21/0x100
[14136.624625] ? page_add_file_rmap+0xef/0x130 [14136.624702] ? alloc_set_pte+0x21c/0x440 [14136.624779] ? _cond_resched+0x15/0x30 [14136.624885] __xfs_filemap_fault+0x6d/0x200 [xfs] [14136.624971] __do_fault+0x36/0xd0 [14136.625033] __handle_mm_fault+0xa7a/0xca0 [14136.625108] handle_mm_fault+0xc2/0x1d0 [14136.625178] __do_page_fault+0x1ed/0x4c0 [14136.625249] do_page_fault+0x37/0x130 [14136.625316] ? page_fault+0x8/0x30 [14136.625379] page_fault+0x1e/0x30 [14136.625440] RIP: 0033:0x7fbd5b2b00e0 [14136.625508] Code: Unable to access opcode bytes at RIP 0x7fbd5b2b00b6."
I wonder if 16gb memory is not enough for the node that is serving the A100 card. It is a VM on VMWare with Direct Passthrough. We are not using vGPU
@sandrich did you try it out with increased memory mapped to VM?
@shivamerla I did which did not change anything. What did change is adding more memory to the container
@sandrich can you check if below settings are enabled on your VM:
pciPassthru.use64bitMMIO=”TRUE”
pciPassthru.64bitMMIOSizeGB=128
Yes this one is set
I run a rapidsai container with jupyter notebook. When I freshly start the container all is fine. I can run some GPU workload inside the notebook.
Then randomly the notebook kernel gets killed. When I check nvidia-smi it crashes
I am not sure how to further debug this issue and where this comes from?
Environment: OpenShift 4.7 GPU: Nvidia A100, MIG mode using the mig manager Operator: 1.7.1
ClusterPolicy
Any idea how to debug where this issue comes from? Also we need 11.2 support I suppose we cannot go with a newer toolkit image?