NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 305 forks source link

gpu-operator pod in CrashLoopBackOff #331

Open smithbk opened 2 years ago

smithbk commented 2 years ago

1. Quick Debug Checklist

1. Issue or feature description

The gpu operator pod is in CrashLoopBackOff.

NOTE: This is a follow on to https://github.com/NVIDIA/gpu-operator/issues/330.

2. Steps to reproduce the issue

I am on openshift version 4.6.26 and trying to install the NVIDIA GPU operator v1.7.1 via the console.

3. Information to attach (optional if deemed irrelevant)

The following shows the state and logs for the gpu operator pod and the logs.

$ oc get pod gpu-operator-566644fc46-2znxj
NAME                            READY   STATUS    RESTARTS   AGE
gpu-operator-566644fc46-2znxj   1/1     Running   5          6m16s
$ oc get pod gpu-operator-566644fc46-2znxj
NAME                            READY   STATUS      RESTARTS   AGE
gpu-operator-566644fc46-2znxj   0/1     OOMKilled   5          6m27s
$ oc get pod gpu-operator-566644fc46-2znxj
NAME                            READY   STATUS             RESTARTS   AGE
gpu-operator-566644fc46-2znxj   0/1     CrashLoopBackOff   5          6m31s
$ oc logs gpu-operator-566644fc46-2znxj -f
I0405 14:09:13.490124       1 request.go:655] Throttling request took 1.043831213s, request: GET:https://172.23.0.1:443/apis/operator.ibm.com/v1?timeout=32s
2022-04-05T14:09:21.300Z    INFO    controller-runtime.metrics  metrics server is starting to listen    {"addr": ":8080"}
2022-04-05T14:09:21.301Z    INFO    controller-runtime.injectors-warning    Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z    INFO    controller-runtime.injectors-warning    Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z    INFO    controller-runtime.injectors-warning    Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z    INFO    controller-runtime.injectors-warning    Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z    INFO    controller-runtime.injectors-warning    Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z    INFO    controller-runtime.injectors-warning    Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z    INFO    controller-runtime.injectors-warning    Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z    INFO    setup   starting manager
I0405 14:09:21.301793       1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/53822513.nvidia.com...
2022-04-05T14:09:21.301Z    INFO    controller-runtime.manager  starting metrics server {"path": "/metrics"}
I0405 14:09:38.742220       1 leaderelection.go:253] successfully acquired lease openshift-operators/53822513.nvidia.com
2022-04-05T14:09:38.742Z    INFO    controller-runtime.manager.controller.clusterpolicy-controller  Starting EventSource    {"source": "kind source: /, Kind="}
2022-04-05T14:09:38.742Z    DEBUG   controller-runtime.manager.events   Normal  {"object": {"kind":"ConfigMap","namespace":"openshift-operators","name":"53822513.nvidia.com","uid":"20e58758-fe21-40f9-80b7-3f7d24ecea7e","apiVersion":"v1","resourceVersion":"2508020210"}, "reason": "LeaderElection", "message": "gpu-operator-566644fc46-2znxj_81321000-656f-4d46-bb25-09f9cd573143 became leader"}
2022-04-05T14:09:38.742Z    DEBUG   controller-runtime.manager.events   Normal  {"object": {"kind":"Lease","namespace":"openshift-operators","name":"53822513.nvidia.com","uid":"1f1a391c-afac-4452-b669-3543e388e16f","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2508020211"}, "reason": "LeaderElection", "message": "gpu-operator-566644fc46-2znxj_81321000-656f-4d46-bb25-09f9cd573143 became leader"}
2022-04-05T14:09:38.843Z    INFO    controller-runtime.manager.controller.clusterpolicy-controller  Starting EventSource    {"source": "kind source: /, Kind="}
2022-04-05T14:09:38.943Z    INFO    controller-runtime.manager.controller.clusterpolicy-controller  Starting EventSource    {"source": "kind source: /, Kind="}
2022-04-05T14:09:39.949Z    INFO    controller-runtime.manager.controller.clusterpolicy-controller  Starting Controller
2022-04-05T14:09:39.949Z    INFO    controller-runtime.manager.controller.clusterpolicy-controller  Starting workers    {"worker count": 1}
2022-04-05T14:09:39.955Z    INFO    controllers.ClusterPolicy   Getting assets from:    {"path:": "/opt/gpu-operator/pre-requisites"}
2022-04-05T14:09:39.956Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "Namespace", "in path:": "/opt/gpu-operator/pre-requisites"}
2022-04-05T14:09:39.956Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "RuntimeClass", "in path:": "/opt/gpu-operator/pre-requisites"}
2022-04-05T14:09:39.957Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "PodSecurityPolicy", "in path:": "/opt/gpu-operator/pre-requisites"}
2022-04-05T14:09:39.959Z    INFO    controllers.ClusterPolicy   Getting assets from:    {"path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.959Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.960Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "Role", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.961Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ClusterRole", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.961Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.962Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ClusterRoleBinding", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.963Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ConfigMap", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.963Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "SecurityContextConstraints", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.964Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.971Z    INFO    controllers.ClusterPolicy   Getting assets from:    {"path:": "/opt/gpu-operator/state-container-toolkit"}
2022-04-05T14:09:39.971Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-container-toolkit"}
2022-04-05T14:09:39.971Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "Role", "in path:": "/opt/gpu-operator/state-container-toolkit"}
2022-04-05T14:09:39.971Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-container-toolkit"}
2022-04-05T14:09:39.972Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-container-toolkit"}
2022-04-05T14:09:39.972Z    INFO    controllers.ClusterPolicy   Getting assets from:    {"path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.973Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.973Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "Role", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.973Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ClusterRole", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.973Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.973Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ClusterRoleBinding", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.974Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "SecurityContextConstraints", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.974Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.975Z    INFO    controllers.ClusterPolicy   Getting assets from:    {"path:": "/opt/gpu-operator/state-device-plugin"}
2022-04-05T14:09:39.975Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-device-plugin"}
2022-04-05T14:09:39.976Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "Role", "in path:": "/opt/gpu-operator/state-device-plugin"}
2022-04-05T14:09:39.976Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-device-plugin"}
2022-04-05T14:09:39.976Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-device-plugin"}
2022-04-05T14:09:39.977Z    INFO    controllers.ClusterPolicy   Getting assets from:    {"path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.977Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.978Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "Role", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.978Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.978Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "Role", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.978Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.979Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "Service", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.980Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ServiceMonitor", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.982Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ConfigMap", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.982Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "SecurityContextConstraints", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.983Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.983Z    INFO    controllers.ClusterPolicy   Getting assets from:    {"path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.984Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.984Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "Role", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.984Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.985Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "SecurityContextConstraints", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.985Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.986Z    INFO    controllers.ClusterPolicy   Getting assets from:    {"path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.986Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.987Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "Role", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.987Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ClusterRole", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.987Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.987Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ClusterRoleBinding", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.987Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "ConfigMap", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.988Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "SecurityContextConstraints", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.988Z    INFO    controllers.ClusterPolicy   DEBUG: Looking for  {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-mig-manager"}
time="2022-04-05T14:09:39Z" level=info msg="Checking GPU state labels on the nodeNodeNameip-10-111-2-106.ec2.internal" source="state_manager.go:236"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.drivervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.gpu-feature-discoveryvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.container-toolkitvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.device-pluginvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.dcgm-exportervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.operator-validatorvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg="Checking GPU state labels on the nodeNodeNameip-10-111-35-191.ec2.internal" source="state_manager.go:236"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.dcgm-exportervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.operator-validatorvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.drivervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.gpu-feature-discoveryvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.container-toolkitvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.device-pluginvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg="Checking GPU state labels on the nodeNodeNameip-10-111-6-113.ec2.internal" source="state_manager.go:236"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.drivervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.gpu-feature-discoveryvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.container-toolkitvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.device-pluginvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.dcgm-exportervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.operator-validatorvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg="Checking GPU state labels on the nodeNodeNameip-10-111-5-155.ec2.internal" source="state_manager.go:236"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.drivervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.gpu-feature-discoveryvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.container-toolkitvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.device-pluginvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.dcgm-exportervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.operator-validatorvaluetrue" source="state_manager.go:137"
2022-04-05T14:09:40.020Z    INFO    controllers.ClusterPolicy   Found Resource  {"Namespace": "gpu-operator-resources"}
2022-04-05T14:09:40.029Z    INFO    controllers.ClusterPolicy   Found Resource  {"RuntimeClass": "nvidia"}
2022-04-05T14:09:40.041Z    INFO    controllers.ClusterPolicy   Found Resource  {"ServiceAccount": "nvidia-driver", "Namespace": "gpu-operator-resources"}
2022-04-05T14:09:40.050Z    INFO    controllers.ClusterPolicy   Found Resource  {"Role": "nvidia-driver", "Namespace": "gpu-operator-resources"}
2022-04-05T14:09:40.059Z    INFO    controllers.ClusterPolicy   Found Resource  {"ClusterRole": "nvidia-driver", "Namespace": ""}
2022-04-05T14:09:40.069Z    INFO    controllers.ClusterPolicy   Found Resource  {"RoleBinding": "nvidia-driver", "Namespace": "gpu-operator-resources"}
2022-04-05T14:09:40.079Z    INFO    controllers.ClusterPolicy   Found Resource  {"ClusterRoleBinding": "nvidia-driver", "Namespace": ""}
2022-04-05T14:09:40.088Z    INFO    controllers.ClusterPolicy   Found Resource  {"ConfigMap": "nvidia-driver", "Namespace": "gpu-operator-resources"}
2022-04-05T14:09:40.099Z    INFO    controllers.ClusterPolicy   Found Resource  {"SecurityContextConstraints": "nvidia-driver", "Namespace": "default"}
2022-04-05T14:09:40.099Z    INFO    controllers.ClusterPolicy   4.18.0-193.47.1.el8_2.x86_64    {"Request.Namespace": "default", "Request.Name": "Node"}
kpouget commented 2 years ago

I don't know what can be going wrong here, we installed together the GPU Operator v1.7.1 from OperatorHub, things were smooth after we solved the issue of https://github.com/NVIDIA/gpu-operator/issues/330,

but I don't know why the operator is crashing hard and silently like that :/

for reference, here is a valid log of the GPU Operator v1.7.1 on OCP 4.6: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-rh-ecosystem-edge-ci-artifacts-master-4.6-nvidia-gpu-operator-e2e-1-7-0/1511116673678053376/artifacts/nvidia-gpu-operator-e2e-1-7-0/nightly/artifacts/012__gpu_operator__capture_deployment_state/gpu_operator.log

smithbk commented 2 years ago

@kpouget Kevin, do you know who might be able to help with this? Thanks

kpouget commented 2 years ago

@smithbk can you describe the operator Pod?

we didn't see that together

gpu-operator-566644fc46-2znxj   0/1     OOMKilled   5          6m27s

but likely this is the reason why the operator is crashing without any error message

@shivamerla do you remember a memory issue on 1.7.1, with 4 GPU nodes?

I see this in the Pod spec:

                resources:
                  limits:
                    cpu: 500m
                    memory: 250Mi
                  requests:
                    cpu: 200m
                    memory: 100Mi
smithbk commented 2 years ago

@kpouget @shivamerla Here is the pod description

$ oc describe pod gpu-operator-566644fc46-2znxj
Name:                 gpu-operator-566644fc46-2znxj
Namespace:            openshift-operators
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-111-61-177.ec2.internal/10.111.61.177
Start Time:           Tue, 05 Apr 2022 10:08:24 -0400
Labels:               app.kubernetes.io/component=gpu-operator
                      name=gpu-operator
                      pod-template-hash=566644fc46
Annotations:          alm-examples:
                        [
                          {
                            "apiVersion": "nvidia.com/v1",
                            "kind": "ClusterPolicy",
                            "metadata": {
                              "name": "gpu-cluster-policy"
                            },
                            "spec": {
                              "dcgmExporter": {
                                "affinity": {},
                                "image": "dcgm-exporter",
                                "imagePullSecrets": [],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.dcgm-exporter": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia/k8s",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:8af02463a8b60b21202d0bf69bc1ee0bb12f684fa367f903d138df6cacc2d0ac"
                              },
                              "devicePlugin": {
                                "affinity": {},
                                "image": "k8s-device-plugin",
                                "imagePullSecrets": [],
                                "args": [],
                                "env": [
                                  {
                                    "name": "PASS_DEVICE_SPECS",
                                    "value": "true"
                                  },
                                  {
                                    "name": "FAIL_ON_INIT_ERROR",
                                    "value": "true"
                                  },
                                  {
                                    "name": "DEVICE_LIST_STRATEGY",
                                    "value": "envvar"
                                  },
                                  {
                                    "name": "DEVICE_ID_STRATEGY",
                                    "value": "uuid"
                                  },
                                  {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "all"
                                  },
                                  {
                                    "name": "NVIDIA_DRIVER_CAPABILITIES",
                                    "value": "all"
                                  }
                                ],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.device-plugin": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:85def0197f388e5e336b1ab0dbec350816c40108a58af946baa1315f4c96ee05"
                              },
                              "driver": {
                                "enabled": true,
                                "affinity": {},
                                "image": "driver",
                                "imagePullSecrets": [],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.driver": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "repoConfig": {
                                  "configMapName": "",
                                  "destinationDir": ""
                                },
                                "licensingConfig": {
                                  "configMapName": ""
                                },
                                "version": "sha256:09ba3eca64a80fab010a9fcd647a2675260272a8c3eb515dfed6dc38a2d31ead"
                              },
                              "gfd": {
                                "affinity": {},
                                "image": "gpu-feature-discovery",
                                "imagePullSecrets": [],
                                "env": [
                                  {
                                    "name": "GFD_SLEEP_INTERVAL",
                                    "value": "60s"
                                  },
                                  {
                                    "name": "FAIL_ON_INIT_ERROR",
                                    "value": "true"
                                  }
                                ],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.gpu-feature-discovery": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:bfc39d23568458dfd50c0c5323b6d42bdcd038c420fb2a2becd513a3ed3be27f"
                              },
                              "migManager": {
                                "enabled": true,
                                "affinity": {},
                                "image": "k8s-mig-manager",
                                "imagePullSecrets": [],
                                "env": [
                                  {
                                    "name": "WITH_REBOOT",
                                    "value": "false"
                                  }
                                ],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.mig-manager": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia/cloud-native",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:495ed3b42e0541590c537ab1b33bda772aad530d3ef6a4f9384d3741a59e2bf8"
                              },
                              "operator": {
                                "defaultRuntime": "crio",
                                "deployGFD": true,
                                "initContainer": {
                                  "image": "cuda",
                                  "repository": "nvcr.io/nvidia",
                                  "version": "sha256:15674e5c45c97994bc92387bad03a0d52d7c1e983709c471c4fecc8e806dbdce",
                                  "imagePullSecrets": []
                                }
                              },
                              "mig": {
                                "strategy": "single"
                              },
                              "toolkit": {
                                "enabled": true,
                                "affinity": {},
                                "image": "container-toolkit",
                                "imagePullSecrets": [],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.container-toolkit": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia/k8s",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:ffa284f1f359d70f0e1d6d8e7752d7c92ef7445b0d74965a8682775de37febf8"
                              },
                              "validator": {
                                "affinity": {},
                                "image": "gpu-operator-validator",
                                "imagePullSecrets": [],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.operator-validator": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia/cloud-native",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:aa1f7bd526ae132c46f3ebe6ecfabe675889e240776ccc2155e31e0c48cc659e",
                                "env": [
                                  {
                                    "name": "WITH_WORKLOAD",
                                    "value": "true"
                                  }
                                ]
                              }
                            }
                          }
                        ]
                      capabilities: Basic Install
                      categories: AI/Machine Learning, OpenShift Optional
                      certified: true
                      cni.projectcalico.org/containerID: aa562b5de68796f144d43e698477d85a889705ce4db6df7dff95e20f82194464
                      cni.projectcalico.org/podIP: 172.27.15.52/32
                      cni.projectcalico.org/podIPs: 172.27.15.52/32
                      containerImage: nvcr.io/nvidia/gpu-operator:v1.7.1
                      createdAt: Wed Jun 16 06:51:51 PDT 2021
                      description: Automate the management and monitoring of NVIDIA GPUs.
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "",
                            "interface": "eth0",
                            "ips": [
                                "172.27.15.52"
                            ],
                            "mac": "86:f1:9f:e8:4f:fe",
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "",
                            "interface": "eth0",
                            "ips": [
                                "172.27.15.52"
                            ],
                            "mac": "86:f1:9f:e8:4f:fe",
                            "default": true,
                            "dns": {}
                        }]
                      olm.operatorGroup: global-operators
                      olm.operatorNamespace: openshift-operators
                      olm.targetNamespaces: 
                      openshift.io/scc: hostmount-anyuid
                      operatorframework.io/properties:
                        {"properties":[{"type":"olm.gvk","value":{"group":"nvidia.com","kind":"ClusterPolicy","version":"v1"}},{"type":"olm.package","value":{"pac...
                      operators.openshift.io/infrastructure-features: ["Disconnected"]
                      operators.operatorframework.io/builder: operator-sdk-v1.4.0
                      operators.operatorframework.io/project_layout: go.kubebuilder.io/v3
                      provider: NVIDIA
                      repository: http://github.com/NVIDIA/gpu-operator
                      support: NVIDIA
Status:               Running
IP:                   172.27.15.52
IPs:
  IP:           172.27.15.52
Controlled By:  ReplicaSet/gpu-operator-566644fc46
Containers:
  gpu-operator:
    Container ID:  cri-o://8f8e24b1c06329b3a19a218408c2ed4787c2d19b7babde6d2d5aceace96324b3
    Image:         nvcr.io/nvidia/gpu-operator@sha256:3a812cf113f416baca9262fa8423f36141f35696eb6e7a51a7abb40f5ccd5f8c
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:3a812cf113f416baca9262fa8423f36141f35696eb6e7a51a7abb40f5ccd5f8c
    Port:          <none>
    Host Port:     <none>
    Command:
      gpu-operator
    Args:
      --leader-elect
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 06 Apr 2022 08:06:16 -0400
      Finished:     Wed, 06 Apr 2022 08:06:48 -0400
    Ready:          False
    Restart Count:  239
    Limits:
      cpu:     500m
      memory:  250Mi
    Requests:
      cpu:      200m
      memory:   100Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      HTTP_PROXY:   http://proxy-app.discoverfinancial.com:8080
      HTTPS_PROXY:  http://proxy-app.discoverfinancial.com:8080
      NO_PROXY:     .artifactory.prdops3-app.ocp.aws.discoverfinancial.com,.aws.discoverfinancial.com,.cluster.local,.discoverfinancial.com,.ec2.internal,.na.discoverfinancial.com,.ocp-dev.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,.ocp.aws.discoverfinancial.com,.ocpdev.us-east-1.ac.discoverfinancial.com,.prdops3-app.ocp.aws.discoverfinancial.com,.rw.discoverfinancial.com,.svc,10.0.0.0/8,10.111.0.0/16,127.0.0.1,169.254.169.254,172.23.0.0/16,172.24.0.0/14,api-int.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,artifactory.prdops3-app.ocp.aws.discoverfinancial.com,aws.discoverfinancial.com,discoverfinancial.com,ec2.internal,etcd-0.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,etcd-1.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,etcd-2.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,localhost,na.discoverfinancial.com,ocp-dev.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,ocp.aws.discoverfinancial.com,ocpdev.us-east-1.ac.discoverfinancial.com,prdops3-app.ocp.aws.discoverfinancial.com,rw.discoverfinancial.com
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from gpu-operator-token-2w6p4 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  gpu-operator-token-2w6p4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  gpu-operator-token-2w6p4
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                     From     Message
  ----     ------          ----                    ----     -------
  Normal   AddedInterface  132m                    multus   Add eth0 [172.27.15.27/32]
  Warning  Unhealthy       120m                    kubelet  Readiness probe failed: Get "http://172.27.15.27:8081/readyz": dial tcp 172.27.15.27:8081: connect: connection refused
  Normal   Pulled          70m (x227 over 21h)     kubelet  Container image "nvcr.io/nvidia/gpu-operator@sha256:3a812cf113f416baca9262fa8423f36141f35696eb6e7a51a7abb40f5ccd5f8c" already present on machine
  Normal   AddedInterface  69m                     multus   Add eth0 [172.27.15.52/32]
  Warning  Unhealthy       30m                     kubelet  Liveness probe failed: Get "http://172.27.15.52:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff         5m18s (x5622 over 21h)  kubelet  Back-off restarting failed container
kpouget commented 2 years ago

still this,

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled

but I expected to see more things in the Event logs ... :/

can you check if your node ip-10-111-61-177.ec2.internal/10.111.61.177 isn't running full of memory?

smithbk commented 2 years ago

@kpouget Looks OK to me. If there is some other way of checking, let me know.

$ oc describe node ip-10-111-61-177.ec2.internal
Name:               ip-10-111-61-177.ec2.internal
Roles:              infra,worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5a.2xlarge
                    beta.kubernetes.io/os=linux
                    contact=OCPEngineers
                    cost_center=458690
                    enterprise.discover.com/cluster-id=aws-useast1-apps-lab-r2jkd
                    enterprise.discover.com/cluster-name=aws-useast1-apps-lab-1
                    enterprise.discover.com/cost_center=458690
                    enterprise.discover.com/data-classification=na
                    enterprise.discover.com/environment=lab
                    enterprise.discover.com/freedom=false
                    enterprise.discover.com/gdpr=false
                    enterprise.discover.com/openshift=true
                    enterprise.discover.com/openshift-role=worker
                    enterprise.discover.com/pci=false
                    enterprise.discover.com/product=common
                    enterprise.discover.com/public=false
                    enterprise.discover.com/support-assignment-group=OCPEngineering
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1d
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.SHA=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/custom-rdma.available=true
                    feature.node.kubernetes.io/kernel-selinux.enabled=true
                    feature.node.kubernetes.io/kernel-version.full=4.18.0-193.47.1.el8_2.x86_64
                    feature.node.kubernetes.io/kernel-version.major=4
                    feature.node.kubernetes.io/kernel-version.minor=18
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-1d0f.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=rhcos
                    feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.6
                    feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.2
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=4.6
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=6
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-111-61-177
                    kubernetes.io/os=linux
                    machine.openshift.io/cluster-api-cluster=aws-useast1-apps-lab-1
                    machine.openshift.io/cluster-api-cluster-name=aws-useast1-apps-lab-1
                    machine.openshift.io/cluster-api-machine-role=worker
                    machine.openshift.io/cluster-api-machineset=infra-1d
                    machine.openshift.io/cluster-api-machineset-group=infra
                    machine.openshift.io/cluster-api-machineset-ha=1d
                    node-role.kubernetes.io/infra=
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m5a.2xlarge
                    node.openshift.io/os_id=rhcos
                    route-reflector=true
                    topology.ebs.csi.aws.com/zone=us-east-1d
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1d
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0fc5da74c55fd897c"}
                    machine.openshift.io/machine: openshift-machine-api/infra-1d-rvc9x
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-3a01af8a0304107341810791e3b3ad99
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-3a01af8a0304107341810791e3b3ad99
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.FMA3,cpu-cpuid.SHA,cpu-cpuid.SSE4A,cpu-hardware_multithreading,custom...
                    nfd.node.kubernetes.io/worker.version: 1.15
                    projectcalico.org/IPv4Address: 10.111.61.177/20
                    projectcalico.org/IPv4IPIPTunnelAddr: 172.27.15.0
                    projectcalico.org/RouteReflectorClusterID: 1.0.0.1
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 24 Jan 2022 17:04:22 -0500
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-111-61-177.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 06 Apr 2022 11:57:41 -0400
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 17 Feb 2022 15:17:07 -0500   Thu, 17 Feb 2022 15:17:07 -0500   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Wed, 06 Apr 2022 11:54:51 -0400   Mon, 24 Jan 2022 17:04:22 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 06 Apr 2022 11:54:51 -0400   Mon, 24 Jan 2022 17:04:22 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 06 Apr 2022 11:54:51 -0400   Mon, 24 Jan 2022 17:04:22 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Wed, 06 Apr 2022 11:54:51 -0400   Mon, 24 Jan 2022 17:05:32 -0500   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.111.61.177
  Hostname:     ip-10-111-61-177.ec2.internal
  InternalDNS:  ip-10-111-61-177.ec2.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         8
  ephemeral-storage:           125277164Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      32288272Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         7500m
  ephemeral-storage:           120795883220
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      31034896Ki
  pods:                        250
System Info:
  Machine ID:                             ec29f9293380ea1eceab3523cbbd2b2a
  System UUID:                            ec29f929-3380-ea1e-ceab-3523cbbd2b2a
  Boot ID:                                89e3a344-ba71-4882-8b39-97738890d719
  Kernel Version:                         4.18.0-193.47.1.el8_2.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 46.82.202104170019-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.19.1-11.rhaos4.6.git050df4c.el8
  Kubelet Version:                        v1.19.0+a5a0987
  Kube-Proxy Version:                     v1.19.0+a5a0987
ProviderID:                               aws:///us-east-1d/i-0fc5da74c55fd897c
Non-terminated Pods:                      (33 in total)
  Namespace                               Name                                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                                          ------------  ----------  ---------------  -------------  ---
  calico-system                           calico-node-wfr7j                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         47d
  eng-attempt48                           eventbus-default-stan-0                                       200m (2%)     400m (5%)   262144k (0%)     2Gi (6%)       35h
  gremlin                                 gremlin-pgxb4                                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         71d
  instana-agent                           instana-agent-x4snr                                           600m (8%)     2 (26%)     2112Mi (6%)      2Gi (6%)       20m
  kube-system                             istio-cni-node-vskdr                                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         71d
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-2z9sj                                 30m (0%)      0 (0%)      150Mi (0%)       0 (0%)         71d
  openshift-cluster-node-tuning-operator  tuned-49mnk                                                   10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         71d
  openshift-compliance                    dfs-ocp4-cis-node-worker-ip-10-111-61-177.ec2.internal-pod    20m (0%)      200m (2%)   70Mi (0%)        600Mi (1%)     19d
  openshift-compliance                    ocp4-cis-node-worker-ip-10-111-61-177.ec2.internal-pod        20m (0%)      200m (2%)   70Mi (0%)        600Mi (1%)     19d
  openshift-dns                           dns-default-b4q5z                                             65m (0%)      0 (0%)      110Mi (0%)       512Mi (1%)     19d
  openshift-image-registry                node-ca-rfx9v                                                 10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         71d
  openshift-ingress                       router-default-55c779749d-5g9l5                               200m (2%)     0 (0%)      512Mi (1%)       0 (0%)         71d
  openshift-kube-proxy                    openshift-kube-proxy-8lr6h                                    100m (1%)     0 (0%)      200Mi (0%)       0 (0%)         19d
  openshift-machine-config-operator       machine-config-daemon-7kmlj                                   40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         71d
  openshift-marketplace                   opencloud-operators-p8vss                                     10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         37h
  openshift-monitoring                    node-exporter-ttdb5                                           9m (0%)       0 (0%)      210Mi (0%)       0 (0%)         71d
  openshift-monitoring                    prometheus-adapter-6b47cfbf98-rvgnt                           1m (0%)       0 (0%)      25Mi (0%)        0 (0%)         2d16h
  openshift-monitoring                    prometheus-operator-68d689dccc-t6rzm                          6m (0%)       0 (0%)      100Mi (0%)       0 (0%)         3d16h
  openshift-multus                        multus-594h4                                                  10m (0%)      0 (0%)      150Mi (0%)       0 (0%)         19d
  openshift-multus                        network-metrics-daemon-5ngdr                                  20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         19d
  openshift-nfd                           nfd-worker-8r252                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
  openshift-node                          splunk-rjhk7                                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         71d
  openshift-operators                     gpu-operator-566644fc46-2znxj                                 200m (2%)     500m (6%)   100Mi (0%)       250Mi (0%)     25h
  openshift-operators                     nfd-worker-qcf7l                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         39d
  postgresql-operator                     postgresql-operator-79f8644dd9-krcfb                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         45h
  sample-project                          mongodb-1-2n98n                                               0 (0%)        0 (0%)      512Mi (1%)       512Mi (1%)     55d
  skunkworks                              backstage-67fc9f9b45-cx4x8                                    350m (4%)     700m (9%)   576Mi (1%)       1152Mi (3%)    42h
  sysdig-agent                            sysdig-agent-fw94l                                            1 (13%)       2 (26%)     512Mi (1%)       1536Mi (5%)    37s
  sysdig-agent                            sysdig-image-analyzer-8xwvw                                   250m (3%)     500m (6%)   512Mi (1%)       1536Mi (5%)    38s
  sysdig-agent                            sysdig-image-analyzer-xpt4q                                   250m (3%)     500m (6%)   512Mi (1%)       1536Mi (5%)    14h
  tigera-compliance                       compliance-benchmarker-br5xl                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         47d
  tigera-fluentd                          fluentd-node-qzpxf                                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         47d
  vault-secrets-operator                  vault-secrets-operator-controller-7598f4bd5f-4cfdc            2 (26%)       2 (26%)     2Gi (6%)         2Gi (6%)       26s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests          Limits
  --------                    --------          ------
  cpu                         5401m (72%)       9 (120%)
  memory                      9501147136 (29%)  14378Mi (47%)
  ephemeral-storage           0 (0%)            0 (0%)
  hugepages-1Gi               0 (0%)            0 (0%)
  hugepages-2Mi               0 (0%)            0 (0%)
  attachable-volumes-aws-ebs  0                 0
Events:                       <none>
smithbk commented 2 years ago

@kpouget Any other ideas of what to check, or someone else who would know? Thanks

shivamerla commented 2 years ago

@smithbk @kpouget Yes, i do remember this happening where momentarily GPU operator memory usage spikes on OCP. We are yet to identify cause for that. we can edit the CSV/Operator Deployment spec to allow following limits

                resources:
                  limits:
                    cpu: 500m
                    memory: 1Gi
                  requests:
                    cpu: 200m
                    memory: 200Mi
smithbk commented 2 years ago

@kpouget The pod is running now but the cluster policy status is not progressing. Here is what I'm seeing now.

$ oc get pod -n openshift-operators | grep gpu-operator
gpu-operator-889b67578-r57p5                   1/1     Running       0          18m

Note the "ClusterPolicy step wasn't ready" messages below.

$ oc logs gpu-operator-889b67578-r57p5 -n openshift-operators --tail 50
2022-04-07T01:11:23.642Z    INFO    controllers.ClusterPolicy   Found Resource  {"ClusterRoleBinding": "nvidia-operator-validator", "Namespace": ""}
2022-04-07T01:11:23.654Z    INFO    controllers.ClusterPolicy   Found Resource  {"SecurityContextConstraints": "nvidia-operator-validator", "Namespace": "default"}
2022-04-07T01:11:23.664Z    INFO    controllers.ClusterPolicy   Found Resource  {"DaemonSet": "nvidia-operator-validator", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.664Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"LabelSelector": "app=nvidia-operator-validator"}
2022-04-07T01:11:23.664Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.664Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"NumberUnavailable": 4}
2022-04-07T01:11:23.664Z    INFO    controllers.ClusterPolicy   ClusterPolicy step wasn't ready {"State:": "notReady"}
2022-04-07T01:11:23.672Z    INFO    controllers.ClusterPolicy   Found Resource  {"ServiceAccount": "nvidia-device-plugin", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.680Z    INFO    controllers.ClusterPolicy   Found Resource  {"Role": "nvidia-device-plugin", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.689Z    INFO    controllers.ClusterPolicy   Found Resource  {"RoleBinding": "nvidia-device-plugin", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.703Z    INFO    controllers.ClusterPolicy   Found Resource  {"DaemonSet": "nvidia-device-plugin-daemonset", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.703Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"LabelSelector": "app=nvidia-device-plugin-daemonset"}
2022-04-07T01:11:23.703Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.703Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"NumberUnavailable": 4}
2022-04-07T01:11:23.703Z    INFO    controllers.ClusterPolicy   ClusterPolicy step wasn't ready {"State:": "notReady"}
2022-04-07T01:11:23.712Z    INFO    controllers.ClusterPolicy   Found Resource  {"ServiceAccount": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.724Z    INFO    controllers.ClusterPolicy   Found Resource  {"Role": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.737Z    INFO    controllers.ClusterPolicy   Found Resource  {"RoleBinding": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.744Z    INFO    controllers.ClusterPolicy   Found Resource  {"Role": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.756Z    INFO    controllers.ClusterPolicy   Found Resource  {"RoleBinding": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.775Z    INFO    controllers.ClusterPolicy   Found Resource  {"Service": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.784Z    INFO    controllers.ClusterPolicy   Found Resource  {"ServiceMonitor": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.793Z    INFO    controllers.ClusterPolicy   Found Resource  {"ConfigMap": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.804Z    INFO    controllers.ClusterPolicy   Found Resource  {"SecurityContextConstraints": "nvidia-dcgm-exporter", "Namespace": "default"}
2022-04-07T01:11:23.804Z    INFO    controllers.ClusterPolicy   4.18.0-193.47.1.el8_2.x86_64    {"Request.Namespace": "default", "Request.Name": "Node"}
2022-04-07T01:11:23.814Z    INFO    controllers.ClusterPolicy   Found Resource  {"DaemonSet": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.814Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"LabelSelector": "app=nvidia-dcgm-exporter"}
2022-04-07T01:11:23.814Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.814Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"NumberUnavailable": 4}
2022-04-07T01:11:23.814Z    INFO    controllers.ClusterPolicy   ClusterPolicy step wasn't ready {"State:": "notReady"}
2022-04-07T01:11:23.821Z    INFO    controllers.ClusterPolicy   Found Resource  {"ServiceAccount": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.828Z    INFO    controllers.ClusterPolicy   Found Resource  {"Role": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.839Z    INFO    controllers.ClusterPolicy   Found Resource  {"RoleBinding": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.850Z    INFO    controllers.ClusterPolicy   Found Resource  {"SecurityContextConstraints": "nvidia-gpu-feature-discovery", "Namespace": "default"}
2022-04-07T01:11:23.858Z    INFO    controllers.ClusterPolicy   Found Resource  {"DaemonSet": "gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.858Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"LabelSelector": "app=gpu-feature-discovery"}
2022-04-07T01:11:23.858Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.858Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"NumberUnavailable": 4}
2022-04-07T01:11:23.858Z    INFO    controllers.ClusterPolicy   ClusterPolicy step wasn't ready {"State:": "notReady"}
2022-04-07T01:11:23.866Z    INFO    controllers.ClusterPolicy   Found Resource  {"ServiceAccount": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.873Z    INFO    controllers.ClusterPolicy   Found Resource  {"Role": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.881Z    INFO    controllers.ClusterPolicy   Found Resource  {"ClusterRole": "nvidia-mig-manager", "Namespace": ""}
2022-04-07T01:11:23.891Z    INFO    controllers.ClusterPolicy   Found Resource  {"RoleBinding": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.909Z    INFO    controllers.ClusterPolicy   Found Resource  {"ClusterRoleBinding": "nvidia-mig-manager", "Namespace": ""}
2022-04-07T01:11:23.918Z    INFO    controllers.ClusterPolicy   Found Resource  {"ConfigMap": "mig-parted-config", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.942Z    INFO    controllers.ClusterPolicy   Found Resource  {"SecurityContextConstraints": "nvidia-driver", "Namespace": "default"}
2022-04-07T01:11:23.952Z    INFO    controllers.ClusterPolicy   Found Resource  {"DaemonSet": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.952Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"LabelSelector": "app=nvidia-mig-manager"}
2022-04-07T01:11:23.952Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.952Z    INFO    controllers.ClusterPolicy   DEBUG: DaemonSet    {"NumberUnavailable": 0}

The pods in gpu-operator-resources are failing.

$ oc get pod -n gpu-operator-resources
NAME                                       READY   STATUS             RESTARTS   AGE
gpu-feature-discovery-2k9fw                0/1     Init:0/1           0          15m
gpu-feature-discovery-7dwvv                0/1     Init:0/1           0          15m
gpu-feature-discovery-tgl5k                0/1     Init:0/1           0          15m
gpu-feature-discovery-vgwlp                0/1     Init:0/1           0          15m
nvidia-container-toolkit-daemonset-c5xck   0/1     Init:0/1           0          15m
nvidia-container-toolkit-daemonset-cc59r   0/1     Init:0/1           0          15m
nvidia-container-toolkit-daemonset-fppnr   0/1     Init:0/1           0          15m
nvidia-container-toolkit-daemonset-jc64m   0/1     Init:0/1           0          15m
nvidia-dcgm-exporter-gb7c4                 0/1     Init:0/2           0          15m
nvidia-dcgm-exporter-hm66s                 0/1     Init:0/2           0          15m
nvidia-dcgm-exporter-mqzzk                 0/1     Init:0/2           0          15m
nvidia-dcgm-exporter-msz6r                 0/1     Init:0/2           0          15m
nvidia-device-plugin-daemonset-cj6bs       0/1     Init:0/1           0          15m
nvidia-device-plugin-daemonset-kn6x6       0/1     Init:0/1           0          15m
nvidia-device-plugin-daemonset-lktnb       0/1     Init:0/1           0          15m
nvidia-device-plugin-daemonset-lv6hx       0/1     Init:0/1           0          15m
nvidia-driver-daemonset-f8g6d              0/1     CrashLoopBackOff   7          15m
nvidia-driver-daemonset-hjvgl              0/1     CrashLoopBackOff   7          15m
nvidia-driver-daemonset-vb85p              0/1     CrashLoopBackOff   7          15m
nvidia-driver-daemonset-xj4tk              0/1     CrashLoopBackOff   7          15m
nvidia-operator-validator-pzp8s            0/1     Init:0/4           0          15m
nvidia-operator-validator-rd6cq            0/1     Init:0/4           0          15m
nvidia-operator-validator-t7n5z            0/1     Init:0/4           0          15m
nvidia-operator-validator-wzgp9            0/1     Init:0/4           0          15m
$ oc logs nvidia-driver-daemonset-f8g6d -n gpu-operator-resources
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=460.73.01
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ RESOLVE_OCP_VERSION=true
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-193.47.1.el8_2.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ _resolve_rhel_version
+ '[' -f /host-etc/os-release ']'
+ echo 'Resolving RHEL version...'
Resolving RHEL version...
+ local version=
++ cat /host-etc/os-release
++ sed -e 's/^"//' -e 's/"$//'
++ awk -F= '{print $2}'
++ grep '^ID='
+ local id=rhcos
+ '[' rhcos = rhcos ']'
++ grep RHEL_VERSION
++ awk -F= '{print $2}'
++ sed -e 's/^"//' -e 's/"$//'
++ cat /host-etc/os-release
+ version=8.2
+ '[' -z 8.2 ']'
+ RHEL_VERSION=8.2
+ echo 'Proceeding with RHEL version 8.2'
Proceeding with RHEL version 8.2
+ return 0
+ _resolve_ocp_version
+ '[' true = true ']'
++ jq '.items[].status.desired.version'
++ sed -e 's/^"//' -e 's/"$//'
++ awk -F. '{printf("%d.%d\n", $1, $2)}'
++ kubectl get clusterversion -o json
Unable to connect to the server: Proxy Authentication Required
+ local version=
Resolving OpenShift version...
+ echo 'Resolving OpenShift version...'
+ '[' -z '' ']'
+ echo 'Could not resolve OpenShift version'
Could not resolve OpenShift version
+ return 1
+ exit 1

It seems that the root cause of this problem is the following, right?

++ kubectl get clusterversion -o json
Unable to connect to the server: Proxy Authentication Required

But this cluster is configured with a proxy.

$ oc get proxy
NAME      AGE
cluster   455d

Any ideas? Should I delete the cluster policy, delete the gpu-operator-resources namespace, and then recreate the cluster policy? I'm not sure if creation of the cluster policy recreates the gpu-operator-resources namespace or not.

smithbk commented 2 years ago

@kpouget It appears that kubectl does not recognize CIDR ranges in the no_proxy environment variable; therefore, it is trying to send the request through the proxy.
Perhaps adding a test case with a proxy would be good. Anyway, I added the appropriate IP to no_proxy and it is getting further, but is now failing as follows:

========== NVIDIA Software Installer ==========

+ echo -e 'Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 4.18.0-193.47.1.el8_2.x86_64\n'
Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 4.18.0-193.47.1.el8_2.x86_64

+ exec
+ flock -n 3
+ echo 1946547
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ _kernel_requires_package
+ local proc_mount_arg=
+ echo 'Checking NVIDIA driver packages...'
Checking NVIDIA driver packages...
+ [[ ! -d /usr/src/nvidia-460.73.01/kernel ]]
+ cd /usr/src/nvidia-460.73.01/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/4.18.0-193.47.1.el8_2.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
Updating the package cache...
+ yum -q makecache
Error: Failed to download metadata for repo 'cuda': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
+ _shutdown
shivamerla commented 2 years ago

@smithbk looks like access to cuda repository is blocked through proxy, can you check if developer.download.nvidia.com is blocked?

shivamerla commented 2 years ago

Also, to test if driver can pull all repositories from container you can run

cat <<EOF > test-ca-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: trusted-ca
  labels:
    config.openshift.io/inject-trusted-cabundle: "true"
EOF
cat <<EOF > test-entitlements-proxy.yaml
apiVersion: v1
kind: Pod
metadata:
 name: entitlements-proxy
spec:
 containers:
   - name: cluster-entitled-build
     image: registry.access.redhat.com/ubi8:latest
     command: [ "/bin/sh", "-c", "dnf -d 5 search kernel-devel --showduplicates" ]
     env:
     - name: HTTP_PROXY
       value: ${HTTP_PROXY}
     - name: HTTPS_PROXY
       value: ${HTTPS_PROXY}
     - name: NO_PROXY
       value: ${NO_PROXY}
     volumeMounts:
     - name: trusted-ca
       mountPath: "/etc/pki/ca-trust/extracted/pem/"
       readOnly: true
 volumes:
 - name: trusted-ca
   configMap:
     name: trusted-ca
     items:
     - key: ca-bundle.crt
       path: tls-ca-bundle.pem
 restartPolicy: Never
EOF
oc apply -f test-ca-configmap.yaml  -f test-entitlements-proxy.yaml

You can get HTTP_PROXY HTTPS_PROXY and NO_PROXY values from cluster wide proxy oc describe proxy cluster

ctrought commented 2 years ago

@smithbk @kpouget Yes, i do remember this happening where momentarily GPU operator memory usage spikes on OCP. We are yet to identify cause for that. we can edit the CSV/Operator Deployment spec to allow following limits

                resources:
                  limits:
                    cpu: 500m
                    memory: 1Gi
                  requests:
                    cpu: 200m
                    memory: 200Mi

We hit memory issues on OCP after upgrading the nvidia operator recently. We were running under 1 Gi previously, and since then the operator pod hits over 2.5 Gi on startup. In the past as seen with other operators, it was normally when the operator was configured to list/watch objects with a cluster scope... in large clusters with many objects that means more data being returned to the operator. I don't know if thats the case for this operator, but I see it has clusterrolebindings. I did not dig into it further, we bumped up the memory again and it's working for now.

shivamerla commented 2 years ago

thanks @ctrought i will work with Red Hat to understand this behavior on OCP. We are not seeing this with K8s. Operator does fetch all node labels at startup, but it should not consume that large memory momentarily.