Open smithbk opened 2 years ago
I don't know what can be going wrong here, we installed together the GPU Operator v1.7.1 from OperatorHub, things were smooth after we solved the issue of https://github.com/NVIDIA/gpu-operator/issues/330,
but I don't know why the operator is crashing hard and silently like that :/
for reference, here is a valid log of the GPU Operator v1.7.1 on OCP 4.6: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-rh-ecosystem-edge-ci-artifacts-master-4.6-nvidia-gpu-operator-e2e-1-7-0/1511116673678053376/artifacts/nvidia-gpu-operator-e2e-1-7-0/nightly/artifacts/012__gpu_operator__capture_deployment_state/gpu_operator.log
@kpouget Kevin, do you know who might be able to help with this? Thanks
@smithbk can you describe
the operator Pod?
we didn't see that together
gpu-operator-566644fc46-2znxj 0/1 OOMKilled 5 6m27s
but likely this is the reason why the operator is crashing without any error message
@shivamerla do you remember a memory issue on 1.7.1, with 4 GPU nodes?
I see this in the Pod spec:
resources:
limits:
cpu: 500m
memory: 250Mi
requests:
cpu: 200m
memory: 100Mi
@kpouget @shivamerla Here is the pod description
$ oc describe pod gpu-operator-566644fc46-2znxj
Name: gpu-operator-566644fc46-2znxj
Namespace: openshift-operators
Priority: 2000001000
Priority Class Name: system-node-critical
Node: ip-10-111-61-177.ec2.internal/10.111.61.177
Start Time: Tue, 05 Apr 2022 10:08:24 -0400
Labels: app.kubernetes.io/component=gpu-operator
name=gpu-operator
pod-template-hash=566644fc46
Annotations: alm-examples:
[
{
"apiVersion": "nvidia.com/v1",
"kind": "ClusterPolicy",
"metadata": {
"name": "gpu-cluster-policy"
},
"spec": {
"dcgmExporter": {
"affinity": {},
"image": "dcgm-exporter",
"imagePullSecrets": [],
"nodeSelector": {
"nvidia.com/gpu.deploy.dcgm-exporter": "true"
},
"podSecurityContext": {},
"repository": "nvcr.io/nvidia/k8s",
"resources": {},
"securityContext": {},
"tolerations": [],
"priorityClassName": "system-node-critical",
"version": "sha256:8af02463a8b60b21202d0bf69bc1ee0bb12f684fa367f903d138df6cacc2d0ac"
},
"devicePlugin": {
"affinity": {},
"image": "k8s-device-plugin",
"imagePullSecrets": [],
"args": [],
"env": [
{
"name": "PASS_DEVICE_SPECS",
"value": "true"
},
{
"name": "FAIL_ON_INIT_ERROR",
"value": "true"
},
{
"name": "DEVICE_LIST_STRATEGY",
"value": "envvar"
},
{
"name": "DEVICE_ID_STRATEGY",
"value": "uuid"
},
{
"name": "NVIDIA_VISIBLE_DEVICES",
"value": "all"
},
{
"name": "NVIDIA_DRIVER_CAPABILITIES",
"value": "all"
}
],
"nodeSelector": {
"nvidia.com/gpu.deploy.device-plugin": "true"
},
"podSecurityContext": {},
"repository": "nvcr.io/nvidia",
"resources": {},
"securityContext": {},
"tolerations": [],
"priorityClassName": "system-node-critical",
"version": "sha256:85def0197f388e5e336b1ab0dbec350816c40108a58af946baa1315f4c96ee05"
},
"driver": {
"enabled": true,
"affinity": {},
"image": "driver",
"imagePullSecrets": [],
"nodeSelector": {
"nvidia.com/gpu.deploy.driver": "true"
},
"podSecurityContext": {},
"repository": "nvcr.io/nvidia",
"resources": {},
"securityContext": {},
"tolerations": [],
"priorityClassName": "system-node-critical",
"repoConfig": {
"configMapName": "",
"destinationDir": ""
},
"licensingConfig": {
"configMapName": ""
},
"version": "sha256:09ba3eca64a80fab010a9fcd647a2675260272a8c3eb515dfed6dc38a2d31ead"
},
"gfd": {
"affinity": {},
"image": "gpu-feature-discovery",
"imagePullSecrets": [],
"env": [
{
"name": "GFD_SLEEP_INTERVAL",
"value": "60s"
},
{
"name": "FAIL_ON_INIT_ERROR",
"value": "true"
}
],
"nodeSelector": {
"nvidia.com/gpu.deploy.gpu-feature-discovery": "true"
},
"podSecurityContext": {},
"repository": "nvcr.io/nvidia",
"resources": {},
"securityContext": {},
"tolerations": [],
"priorityClassName": "system-node-critical",
"version": "sha256:bfc39d23568458dfd50c0c5323b6d42bdcd038c420fb2a2becd513a3ed3be27f"
},
"migManager": {
"enabled": true,
"affinity": {},
"image": "k8s-mig-manager",
"imagePullSecrets": [],
"env": [
{
"name": "WITH_REBOOT",
"value": "false"
}
],
"nodeSelector": {
"nvidia.com/gpu.deploy.mig-manager": "true"
},
"podSecurityContext": {},
"repository": "nvcr.io/nvidia/cloud-native",
"resources": {},
"securityContext": {},
"tolerations": [],
"priorityClassName": "system-node-critical",
"version": "sha256:495ed3b42e0541590c537ab1b33bda772aad530d3ef6a4f9384d3741a59e2bf8"
},
"operator": {
"defaultRuntime": "crio",
"deployGFD": true,
"initContainer": {
"image": "cuda",
"repository": "nvcr.io/nvidia",
"version": "sha256:15674e5c45c97994bc92387bad03a0d52d7c1e983709c471c4fecc8e806dbdce",
"imagePullSecrets": []
}
},
"mig": {
"strategy": "single"
},
"toolkit": {
"enabled": true,
"affinity": {},
"image": "container-toolkit",
"imagePullSecrets": [],
"nodeSelector": {
"nvidia.com/gpu.deploy.container-toolkit": "true"
},
"podSecurityContext": {},
"repository": "nvcr.io/nvidia/k8s",
"resources": {},
"securityContext": {},
"tolerations": [],
"priorityClassName": "system-node-critical",
"version": "sha256:ffa284f1f359d70f0e1d6d8e7752d7c92ef7445b0d74965a8682775de37febf8"
},
"validator": {
"affinity": {},
"image": "gpu-operator-validator",
"imagePullSecrets": [],
"nodeSelector": {
"nvidia.com/gpu.deploy.operator-validator": "true"
},
"podSecurityContext": {},
"repository": "nvcr.io/nvidia/cloud-native",
"resources": {},
"securityContext": {},
"tolerations": [],
"priorityClassName": "system-node-critical",
"version": "sha256:aa1f7bd526ae132c46f3ebe6ecfabe675889e240776ccc2155e31e0c48cc659e",
"env": [
{
"name": "WITH_WORKLOAD",
"value": "true"
}
]
}
}
}
]
capabilities: Basic Install
categories: AI/Machine Learning, OpenShift Optional
certified: true
cni.projectcalico.org/containerID: aa562b5de68796f144d43e698477d85a889705ce4db6df7dff95e20f82194464
cni.projectcalico.org/podIP: 172.27.15.52/32
cni.projectcalico.org/podIPs: 172.27.15.52/32
containerImage: nvcr.io/nvidia/gpu-operator:v1.7.1
createdAt: Wed Jun 16 06:51:51 PDT 2021
description: Automate the management and monitoring of NVIDIA GPUs.
k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"172.27.15.52"
],
"mac": "86:f1:9f:e8:4f:fe",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"172.27.15.52"
],
"mac": "86:f1:9f:e8:4f:fe",
"default": true,
"dns": {}
}]
olm.operatorGroup: global-operators
olm.operatorNamespace: openshift-operators
olm.targetNamespaces:
openshift.io/scc: hostmount-anyuid
operatorframework.io/properties:
{"properties":[{"type":"olm.gvk","value":{"group":"nvidia.com","kind":"ClusterPolicy","version":"v1"}},{"type":"olm.package","value":{"pac...
operators.openshift.io/infrastructure-features: ["Disconnected"]
operators.operatorframework.io/builder: operator-sdk-v1.4.0
operators.operatorframework.io/project_layout: go.kubebuilder.io/v3
provider: NVIDIA
repository: http://github.com/NVIDIA/gpu-operator
support: NVIDIA
Status: Running
IP: 172.27.15.52
IPs:
IP: 172.27.15.52
Controlled By: ReplicaSet/gpu-operator-566644fc46
Containers:
gpu-operator:
Container ID: cri-o://8f8e24b1c06329b3a19a218408c2ed4787c2d19b7babde6d2d5aceace96324b3
Image: nvcr.io/nvidia/gpu-operator@sha256:3a812cf113f416baca9262fa8423f36141f35696eb6e7a51a7abb40f5ccd5f8c
Image ID: nvcr.io/nvidia/gpu-operator@sha256:3a812cf113f416baca9262fa8423f36141f35696eb6e7a51a7abb40f5ccd5f8c
Port: <none>
Host Port: <none>
Command:
gpu-operator
Args:
--leader-elect
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 06 Apr 2022 08:06:16 -0400
Finished: Wed, 06 Apr 2022 08:06:48 -0400
Ready: False
Restart Count: 239
Limits:
cpu: 500m
memory: 250Mi
Requests:
cpu: 200m
memory: 100Mi
Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
HTTP_PROXY: http://proxy-app.discoverfinancial.com:8080
HTTPS_PROXY: http://proxy-app.discoverfinancial.com:8080
NO_PROXY: .artifactory.prdops3-app.ocp.aws.discoverfinancial.com,.aws.discoverfinancial.com,.cluster.local,.discoverfinancial.com,.ec2.internal,.na.discoverfinancial.com,.ocp-dev.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,.ocp.aws.discoverfinancial.com,.ocpdev.us-east-1.ac.discoverfinancial.com,.prdops3-app.ocp.aws.discoverfinancial.com,.rw.discoverfinancial.com,.svc,10.0.0.0/8,10.111.0.0/16,127.0.0.1,169.254.169.254,172.23.0.0/16,172.24.0.0/14,api-int.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,artifactory.prdops3-app.ocp.aws.discoverfinancial.com,aws.discoverfinancial.com,discoverfinancial.com,ec2.internal,etcd-0.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,etcd-1.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,etcd-2.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,localhost,na.discoverfinancial.com,ocp-dev.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,ocp.aws.discoverfinancial.com,ocpdev.us-east-1.ac.discoverfinancial.com,prdops3-app.ocp.aws.discoverfinancial.com,rw.discoverfinancial.com
Mounts:
/host-etc/os-release from host-os-release (ro)
/var/run/secrets/kubernetes.io/serviceaccount from gpu-operator-token-2w6p4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
gpu-operator-token-2w6p4:
Type: Secret (a volume populated by a Secret)
SecretName: gpu-operator-token-2w6p4
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal AddedInterface 132m multus Add eth0 [172.27.15.27/32]
Warning Unhealthy 120m kubelet Readiness probe failed: Get "http://172.27.15.27:8081/readyz": dial tcp 172.27.15.27:8081: connect: connection refused
Normal Pulled 70m (x227 over 21h) kubelet Container image "nvcr.io/nvidia/gpu-operator@sha256:3a812cf113f416baca9262fa8423f36141f35696eb6e7a51a7abb40f5ccd5f8c" already present on machine
Normal AddedInterface 69m multus Add eth0 [172.27.15.52/32]
Warning Unhealthy 30m kubelet Liveness probe failed: Get "http://172.27.15.52:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning BackOff 5m18s (x5622 over 21h) kubelet Back-off restarting failed container
still this,
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
but I expected to see more things in the Event
logs ... :/
can you check if your node ip-10-111-61-177.ec2.internal/10.111.61.177
isn't running full of memory?
@kpouget Looks OK to me. If there is some other way of checking, let me know.
$ oc describe node ip-10-111-61-177.ec2.internal
Name: ip-10-111-61-177.ec2.internal
Roles: infra,worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5a.2xlarge
beta.kubernetes.io/os=linux
contact=OCPEngineers
cost_center=458690
enterprise.discover.com/cluster-id=aws-useast1-apps-lab-r2jkd
enterprise.discover.com/cluster-name=aws-useast1-apps-lab-1
enterprise.discover.com/cost_center=458690
enterprise.discover.com/data-classification=na
enterprise.discover.com/environment=lab
enterprise.discover.com/freedom=false
enterprise.discover.com/gdpr=false
enterprise.discover.com/openshift=true
enterprise.discover.com/openshift-role=worker
enterprise.discover.com/pci=false
enterprise.discover.com/product=common
enterprise.discover.com/public=false
enterprise.discover.com/support-assignment-group=OCPEngineering
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1d
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.SHA=true
feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/custom-rdma.available=true
feature.node.kubernetes.io/kernel-selinux.enabled=true
feature.node.kubernetes.io/kernel-version.full=4.18.0-193.47.1.el8_2.x86_64
feature.node.kubernetes.io/kernel-version.major=4
feature.node.kubernetes.io/kernel-version.minor=18
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-1d0f.present=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
feature.node.kubernetes.io/system-os_release.ID=rhcos
feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.6
feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.2
feature.node.kubernetes.io/system-os_release.VERSION_ID=4.6
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=6
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-111-61-177
kubernetes.io/os=linux
machine.openshift.io/cluster-api-cluster=aws-useast1-apps-lab-1
machine.openshift.io/cluster-api-cluster-name=aws-useast1-apps-lab-1
machine.openshift.io/cluster-api-machine-role=worker
machine.openshift.io/cluster-api-machineset=infra-1d
machine.openshift.io/cluster-api-machineset-group=infra
machine.openshift.io/cluster-api-machineset-ha=1d
node-role.kubernetes.io/infra=
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=m5a.2xlarge
node.openshift.io/os_id=rhcos
route-reflector=true
topology.ebs.csi.aws.com/zone=us-east-1d
topology.kubernetes.io/region=us-east-1
topology.kubernetes.io/zone=us-east-1d
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0fc5da74c55fd897c"}
machine.openshift.io/machine: openshift-machine-api/infra-1d-rvc9x
machineconfiguration.openshift.io/currentConfig: rendered-worker-3a01af8a0304107341810791e3b3ad99
machineconfiguration.openshift.io/desiredConfig: rendered-worker-3a01af8a0304107341810791e3b3ad99
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
nfd.node.kubernetes.io/extended-resources:
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.FMA3,cpu-cpuid.SHA,cpu-cpuid.SSE4A,cpu-hardware_multithreading,custom...
nfd.node.kubernetes.io/worker.version: 1.15
projectcalico.org/IPv4Address: 10.111.61.177/20
projectcalico.org/IPv4IPIPTunnelAddr: 172.27.15.0
projectcalico.org/RouteReflectorClusterID: 1.0.0.1
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 24 Jan 2022 17:04:22 -0500
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ip-10-111-61-177.ec2.internal
AcquireTime: <unset>
RenewTime: Wed, 06 Apr 2022 11:57:41 -0400
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 17 Feb 2022 15:17:07 -0500 Thu, 17 Feb 2022 15:17:07 -0500 CalicoIsUp Calico is running on this node
MemoryPressure False Wed, 06 Apr 2022 11:54:51 -0400 Mon, 24 Jan 2022 17:04:22 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 06 Apr 2022 11:54:51 -0400 Mon, 24 Jan 2022 17:04:22 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 06 Apr 2022 11:54:51 -0400 Mon, 24 Jan 2022 17:04:22 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 06 Apr 2022 11:54:51 -0400 Mon, 24 Jan 2022 17:05:32 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.111.61.177
Hostname: ip-10-111-61-177.ec2.internal
InternalDNS: ip-10-111-61-177.ec2.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 8
ephemeral-storage: 125277164Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32288272Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 7500m
ephemeral-storage: 120795883220
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31034896Ki
pods: 250
System Info:
Machine ID: ec29f9293380ea1eceab3523cbbd2b2a
System UUID: ec29f929-3380-ea1e-ceab-3523cbbd2b2a
Boot ID: 89e3a344-ba71-4882-8b39-97738890d719
Kernel Version: 4.18.0-193.47.1.el8_2.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 46.82.202104170019-0 (Ootpa)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.19.1-11.rhaos4.6.git050df4c.el8
Kubelet Version: v1.19.0+a5a0987
Kube-Proxy Version: v1.19.0+a5a0987
ProviderID: aws:///us-east-1d/i-0fc5da74c55fd897c
Non-terminated Pods: (33 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
calico-system calico-node-wfr7j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 47d
eng-attempt48 eventbus-default-stan-0 200m (2%) 400m (5%) 262144k (0%) 2Gi (6%) 35h
gremlin gremlin-pgxb4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
instana-agent instana-agent-x4snr 600m (8%) 2 (26%) 2112Mi (6%) 2Gi (6%) 20m
kube-system istio-cni-node-vskdr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
openshift-cluster-csi-drivers aws-ebs-csi-driver-node-2z9sj 30m (0%) 0 (0%) 150Mi (0%) 0 (0%) 71d
openshift-cluster-node-tuning-operator tuned-49mnk 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 71d
openshift-compliance dfs-ocp4-cis-node-worker-ip-10-111-61-177.ec2.internal-pod 20m (0%) 200m (2%) 70Mi (0%) 600Mi (1%) 19d
openshift-compliance ocp4-cis-node-worker-ip-10-111-61-177.ec2.internal-pod 20m (0%) 200m (2%) 70Mi (0%) 600Mi (1%) 19d
openshift-dns dns-default-b4q5z 65m (0%) 0 (0%) 110Mi (0%) 512Mi (1%) 19d
openshift-image-registry node-ca-rfx9v 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 71d
openshift-ingress router-default-55c779749d-5g9l5 200m (2%) 0 (0%) 512Mi (1%) 0 (0%) 71d
openshift-kube-proxy openshift-kube-proxy-8lr6h 100m (1%) 0 (0%) 200Mi (0%) 0 (0%) 19d
openshift-machine-config-operator machine-config-daemon-7kmlj 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 71d
openshift-marketplace opencloud-operators-p8vss 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 37h
openshift-monitoring node-exporter-ttdb5 9m (0%) 0 (0%) 210Mi (0%) 0 (0%) 71d
openshift-monitoring prometheus-adapter-6b47cfbf98-rvgnt 1m (0%) 0 (0%) 25Mi (0%) 0 (0%) 2d16h
openshift-monitoring prometheus-operator-68d689dccc-t6rzm 6m (0%) 0 (0%) 100Mi (0%) 0 (0%) 3d16h
openshift-multus multus-594h4 10m (0%) 0 (0%) 150Mi (0%) 0 (0%) 19d
openshift-multus network-metrics-daemon-5ngdr 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 19d
openshift-nfd nfd-worker-8r252 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
openshift-node splunk-rjhk7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
openshift-operators gpu-operator-566644fc46-2znxj 200m (2%) 500m (6%) 100Mi (0%) 250Mi (0%) 25h
openshift-operators nfd-worker-qcf7l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 39d
postgresql-operator postgresql-operator-79f8644dd9-krcfb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 45h
sample-project mongodb-1-2n98n 0 (0%) 0 (0%) 512Mi (1%) 512Mi (1%) 55d
skunkworks backstage-67fc9f9b45-cx4x8 350m (4%) 700m (9%) 576Mi (1%) 1152Mi (3%) 42h
sysdig-agent sysdig-agent-fw94l 1 (13%) 2 (26%) 512Mi (1%) 1536Mi (5%) 37s
sysdig-agent sysdig-image-analyzer-8xwvw 250m (3%) 500m (6%) 512Mi (1%) 1536Mi (5%) 38s
sysdig-agent sysdig-image-analyzer-xpt4q 250m (3%) 500m (6%) 512Mi (1%) 1536Mi (5%) 14h
tigera-compliance compliance-benchmarker-br5xl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 47d
tigera-fluentd fluentd-node-qzpxf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 47d
vault-secrets-operator vault-secrets-operator-controller-7598f4bd5f-4cfdc 2 (26%) 2 (26%) 2Gi (6%) 2Gi (6%) 26s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 5401m (72%) 9 (120%)
memory 9501147136 (29%) 14378Mi (47%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events: <none>
@kpouget Any other ideas of what to check, or someone else who would know? Thanks
@smithbk @kpouget Yes, i do remember this happening where momentarily GPU operator memory usage spikes on OCP. We are yet to identify cause for that. we can edit the CSV/Operator Deployment spec to allow following limits
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 200m
memory: 200Mi
@kpouget The pod is running now but the cluster policy status is not progressing. Here is what I'm seeing now.
$ oc get pod -n openshift-operators | grep gpu-operator
gpu-operator-889b67578-r57p5 1/1 Running 0 18m
Note the "ClusterPolicy step wasn't ready" messages below.
$ oc logs gpu-operator-889b67578-r57p5 -n openshift-operators --tail 50
2022-04-07T01:11:23.642Z INFO controllers.ClusterPolicy Found Resource {"ClusterRoleBinding": "nvidia-operator-validator", "Namespace": ""}
2022-04-07T01:11:23.654Z INFO controllers.ClusterPolicy Found Resource {"SecurityContextConstraints": "nvidia-operator-validator", "Namespace": "default"}
2022-04-07T01:11:23.664Z INFO controllers.ClusterPolicy Found Resource {"DaemonSet": "nvidia-operator-validator", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.664Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"LabelSelector": "app=nvidia-operator-validator"}
2022-04-07T01:11:23.664Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.664Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"NumberUnavailable": 4}
2022-04-07T01:11:23.664Z INFO controllers.ClusterPolicy ClusterPolicy step wasn't ready {"State:": "notReady"}
2022-04-07T01:11:23.672Z INFO controllers.ClusterPolicy Found Resource {"ServiceAccount": "nvidia-device-plugin", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.680Z INFO controllers.ClusterPolicy Found Resource {"Role": "nvidia-device-plugin", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.689Z INFO controllers.ClusterPolicy Found Resource {"RoleBinding": "nvidia-device-plugin", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.703Z INFO controllers.ClusterPolicy Found Resource {"DaemonSet": "nvidia-device-plugin-daemonset", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.703Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"LabelSelector": "app=nvidia-device-plugin-daemonset"}
2022-04-07T01:11:23.703Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.703Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"NumberUnavailable": 4}
2022-04-07T01:11:23.703Z INFO controllers.ClusterPolicy ClusterPolicy step wasn't ready {"State:": "notReady"}
2022-04-07T01:11:23.712Z INFO controllers.ClusterPolicy Found Resource {"ServiceAccount": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.724Z INFO controllers.ClusterPolicy Found Resource {"Role": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.737Z INFO controllers.ClusterPolicy Found Resource {"RoleBinding": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.744Z INFO controllers.ClusterPolicy Found Resource {"Role": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.756Z INFO controllers.ClusterPolicy Found Resource {"RoleBinding": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.775Z INFO controllers.ClusterPolicy Found Resource {"Service": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.784Z INFO controllers.ClusterPolicy Found Resource {"ServiceMonitor": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.793Z INFO controllers.ClusterPolicy Found Resource {"ConfigMap": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.804Z INFO controllers.ClusterPolicy Found Resource {"SecurityContextConstraints": "nvidia-dcgm-exporter", "Namespace": "default"}
2022-04-07T01:11:23.804Z INFO controllers.ClusterPolicy 4.18.0-193.47.1.el8_2.x86_64 {"Request.Namespace": "default", "Request.Name": "Node"}
2022-04-07T01:11:23.814Z INFO controllers.ClusterPolicy Found Resource {"DaemonSet": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.814Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"LabelSelector": "app=nvidia-dcgm-exporter"}
2022-04-07T01:11:23.814Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.814Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"NumberUnavailable": 4}
2022-04-07T01:11:23.814Z INFO controllers.ClusterPolicy ClusterPolicy step wasn't ready {"State:": "notReady"}
2022-04-07T01:11:23.821Z INFO controllers.ClusterPolicy Found Resource {"ServiceAccount": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.828Z INFO controllers.ClusterPolicy Found Resource {"Role": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.839Z INFO controllers.ClusterPolicy Found Resource {"RoleBinding": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.850Z INFO controllers.ClusterPolicy Found Resource {"SecurityContextConstraints": "nvidia-gpu-feature-discovery", "Namespace": "default"}
2022-04-07T01:11:23.858Z INFO controllers.ClusterPolicy Found Resource {"DaemonSet": "gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.858Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"LabelSelector": "app=gpu-feature-discovery"}
2022-04-07T01:11:23.858Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.858Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"NumberUnavailable": 4}
2022-04-07T01:11:23.858Z INFO controllers.ClusterPolicy ClusterPolicy step wasn't ready {"State:": "notReady"}
2022-04-07T01:11:23.866Z INFO controllers.ClusterPolicy Found Resource {"ServiceAccount": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.873Z INFO controllers.ClusterPolicy Found Resource {"Role": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.881Z INFO controllers.ClusterPolicy Found Resource {"ClusterRole": "nvidia-mig-manager", "Namespace": ""}
2022-04-07T01:11:23.891Z INFO controllers.ClusterPolicy Found Resource {"RoleBinding": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.909Z INFO controllers.ClusterPolicy Found Resource {"ClusterRoleBinding": "nvidia-mig-manager", "Namespace": ""}
2022-04-07T01:11:23.918Z INFO controllers.ClusterPolicy Found Resource {"ConfigMap": "mig-parted-config", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.942Z INFO controllers.ClusterPolicy Found Resource {"SecurityContextConstraints": "nvidia-driver", "Namespace": "default"}
2022-04-07T01:11:23.952Z INFO controllers.ClusterPolicy Found Resource {"DaemonSet": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.952Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"LabelSelector": "app=nvidia-mig-manager"}
2022-04-07T01:11:23.952Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.952Z INFO controllers.ClusterPolicy DEBUG: DaemonSet {"NumberUnavailable": 0}
The pods in gpu-operator-resources are failing.
$ oc get pod -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-2k9fw 0/1 Init:0/1 0 15m
gpu-feature-discovery-7dwvv 0/1 Init:0/1 0 15m
gpu-feature-discovery-tgl5k 0/1 Init:0/1 0 15m
gpu-feature-discovery-vgwlp 0/1 Init:0/1 0 15m
nvidia-container-toolkit-daemonset-c5xck 0/1 Init:0/1 0 15m
nvidia-container-toolkit-daemonset-cc59r 0/1 Init:0/1 0 15m
nvidia-container-toolkit-daemonset-fppnr 0/1 Init:0/1 0 15m
nvidia-container-toolkit-daemonset-jc64m 0/1 Init:0/1 0 15m
nvidia-dcgm-exporter-gb7c4 0/1 Init:0/2 0 15m
nvidia-dcgm-exporter-hm66s 0/1 Init:0/2 0 15m
nvidia-dcgm-exporter-mqzzk 0/1 Init:0/2 0 15m
nvidia-dcgm-exporter-msz6r 0/1 Init:0/2 0 15m
nvidia-device-plugin-daemonset-cj6bs 0/1 Init:0/1 0 15m
nvidia-device-plugin-daemonset-kn6x6 0/1 Init:0/1 0 15m
nvidia-device-plugin-daemonset-lktnb 0/1 Init:0/1 0 15m
nvidia-device-plugin-daemonset-lv6hx 0/1 Init:0/1 0 15m
nvidia-driver-daemonset-f8g6d 0/1 CrashLoopBackOff 7 15m
nvidia-driver-daemonset-hjvgl 0/1 CrashLoopBackOff 7 15m
nvidia-driver-daemonset-vb85p 0/1 CrashLoopBackOff 7 15m
nvidia-driver-daemonset-xj4tk 0/1 CrashLoopBackOff 7 15m
nvidia-operator-validator-pzp8s 0/1 Init:0/4 0 15m
nvidia-operator-validator-rd6cq 0/1 Init:0/4 0 15m
nvidia-operator-validator-t7n5z 0/1 Init:0/4 0 15m
nvidia-operator-validator-wzgp9 0/1 Init:0/4 0 15m
$ oc logs nvidia-driver-daemonset-f8g6d -n gpu-operator-resources
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=460.73.01
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ RESOLVE_OCP_VERSION=true
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-193.47.1.el8_2.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ _resolve_rhel_version
+ '[' -f /host-etc/os-release ']'
+ echo 'Resolving RHEL version...'
Resolving RHEL version...
+ local version=
++ cat /host-etc/os-release
++ sed -e 's/^"//' -e 's/"$//'
++ awk -F= '{print $2}'
++ grep '^ID='
+ local id=rhcos
+ '[' rhcos = rhcos ']'
++ grep RHEL_VERSION
++ awk -F= '{print $2}'
++ sed -e 's/^"//' -e 's/"$//'
++ cat /host-etc/os-release
+ version=8.2
+ '[' -z 8.2 ']'
+ RHEL_VERSION=8.2
+ echo 'Proceeding with RHEL version 8.2'
Proceeding with RHEL version 8.2
+ return 0
+ _resolve_ocp_version
+ '[' true = true ']'
++ jq '.items[].status.desired.version'
++ sed -e 's/^"//' -e 's/"$//'
++ awk -F. '{printf("%d.%d\n", $1, $2)}'
++ kubectl get clusterversion -o json
Unable to connect to the server: Proxy Authentication Required
+ local version=
Resolving OpenShift version...
+ echo 'Resolving OpenShift version...'
+ '[' -z '' ']'
+ echo 'Could not resolve OpenShift version'
Could not resolve OpenShift version
+ return 1
+ exit 1
It seems that the root cause of this problem is the following, right?
++ kubectl get clusterversion -o json
Unable to connect to the server: Proxy Authentication Required
But this cluster is configured with a proxy.
$ oc get proxy
NAME AGE
cluster 455d
Any ideas? Should I delete the cluster policy, delete the gpu-operator-resources namespace, and then recreate the cluster policy? I'm not sure if creation of the cluster policy recreates the gpu-operator-resources namespace or not.
@kpouget It appears that kubectl does not recognize CIDR ranges in the no_proxy environment variable; therefore, it is trying to send the request through the proxy.
Perhaps adding a test case with a proxy would be good.
Anyway, I added the appropriate IP to no_proxy and it is getting further, but is now failing as follows:
========== NVIDIA Software Installer ==========
+ echo -e 'Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 4.18.0-193.47.1.el8_2.x86_64\n'
Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 4.18.0-193.47.1.el8_2.x86_64
+ exec
+ flock -n 3
+ echo 1946547
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ _kernel_requires_package
+ local proc_mount_arg=
+ echo 'Checking NVIDIA driver packages...'
Checking NVIDIA driver packages...
+ [[ ! -d /usr/src/nvidia-460.73.01/kernel ]]
+ cd /usr/src/nvidia-460.73.01/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/4.18.0-193.47.1.el8_2.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
Updating the package cache...
+ yum -q makecache
Error: Failed to download metadata for repo 'cuda': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
+ _shutdown
@smithbk looks like access to cuda repository is blocked through proxy, can you check if developer.download.nvidia.com
is blocked?
Also, to test if driver can pull all repositories from container you can run
cat <<EOF > test-ca-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: trusted-ca
labels:
config.openshift.io/inject-trusted-cabundle: "true"
EOF
cat <<EOF > test-entitlements-proxy.yaml
apiVersion: v1
kind: Pod
metadata:
name: entitlements-proxy
spec:
containers:
- name: cluster-entitled-build
image: registry.access.redhat.com/ubi8:latest
command: [ "/bin/sh", "-c", "dnf -d 5 search kernel-devel --showduplicates" ]
env:
- name: HTTP_PROXY
value: ${HTTP_PROXY}
- name: HTTPS_PROXY
value: ${HTTPS_PROXY}
- name: NO_PROXY
value: ${NO_PROXY}
volumeMounts:
- name: trusted-ca
mountPath: "/etc/pki/ca-trust/extracted/pem/"
readOnly: true
volumes:
- name: trusted-ca
configMap:
name: trusted-ca
items:
- key: ca-bundle.crt
path: tls-ca-bundle.pem
restartPolicy: Never
EOF
oc apply -f test-ca-configmap.yaml -f test-entitlements-proxy.yaml
You can get HTTP_PROXY HTTPS_PROXY and NO_PROXY values from cluster wide proxy oc describe proxy cluster
@smithbk @kpouget Yes, i do remember this happening where momentarily GPU operator memory usage spikes on OCP. We are yet to identify cause for that. we can edit the CSV/Operator Deployment spec to allow following limits
resources: limits: cpu: 500m memory: 1Gi requests: cpu: 200m memory: 200Mi
We hit memory issues on OCP after upgrading the nvidia operator recently. We were running under 1 Gi previously, and since then the operator pod hits over 2.5 Gi on startup. In the past as seen with other operators, it was normally when the operator was configured to list/watch objects with a cluster scope... in large clusters with many objects that means more data being returned to the operator. I don't know if thats the case for this operator, but I see it has clusterrolebindings
. I did not dig into it further, we bumped up the memory again and it's working for now.
thanks @ctrought i will work with Red Hat to understand this behavior on OCP. We are not seeing this with K8s. Operator does fetch all node labels at startup, but it should not consume that large memory momentarily.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
The gpu operator pod is in CrashLoopBackOff.
NOTE: This is a follow on to https://github.com/NVIDIA/gpu-operator/issues/330.
2. Steps to reproduce the issue
I am on openshift version 4.6.26 and trying to install the NVIDIA GPU operator v1.7.1 via the console.
3. Information to attach (optional if deemed irrelevant)
[ ] kubernetes pods status:
kubectl get pods --all-namespaces
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
[ ] Docker configuration file:
cat /etc/docker/daemon.json
[ ] Docker runtime configuration:
docker info | grep runtime
[ ] NVIDIA shared directory:
ls -la /run/nvidia
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs
The following shows the state and logs for the gpu operator pod and the logs.