NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.74k stars 281 forks source link

About the behavior of GPU-Operator when updating EUS #454

Open kousui-dev opened 1 year ago

kousui-dev commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

Hello everyone,

OCP4.8.41 with GPU-Operator1.8.2 installed has been upgraded to OCP4.10.26 by EUS update. The composition of the OCP crust is as follows. ・Master node/3 units ・infra node / 3 units ・Worker node /88 units

When updating from OCP4.8.41 to OCP4.9.45, machineconfig settings prevent worker nodes from rebooting.

Reference: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.8/html-single/updating_clusters/index#updating-eus-to-eus-upgrade_eus-to-eus-upgrade

After a while after executing the version upgrade command in that state, the Pods managed by GPU-Operator (such as nvidia-device-plugin-daemonset ) were restarted.

I am doing the same operation in another cluster, No Pod restarts occurred then.

Which is the correct behavior? If not rebooting is the correct behavior, what caused the reboot?

thanks.

shivamerla commented 1 year ago

@kousui-dev Can you get history of the daemonset rollouts that have happened as part of EUS to EUS upgrades? If driver daemonset is also restarted, please get the status of that to compare the spec change. Driver pod will restart all other pods as they have a dependency.

kubectl rollout history --revision=0 ds/nvidia-driver-daemonset  -n nvidia-gpu-operator -o yaml
kubectl rollout history --revision=1 ds/nvidia-driver-daemonset  -n nvidia-gpu-operator -o yaml
kousui-dev commented 1 year ago

@shivamerla Since it has already been updated to GPU-Operator1.11, it is not possible to obtain the log.

I have obtained the log of the gpu-operator-resources container on the GPU server at that time, so I will share it. It seems that a reboot has occurred due to receiving some kind of signal. In addition, the gpu-operator-resources container has been restarted on all 88 GPU servers, and the same log can be confirmed for all.

Is there any reason why you are concerned?

kubernetes.container_name   message   
nvidia-driver-ctr   Caught signal
nvidia-driver-ctr   ++ echo 'Caught signal'
nvidia-driver-ctr   =++ _shutdown
nvidia-driver-ctr   =++ local rmmod_args
nvidia-driver-ctr   =++ local nvidia_deps'=0
nvidia-driver-ctr   =++ local nvidia_refs'=0
nvidia-driver-ctr   Stopping NVIDIA persistence daemon...
nvidia-driver-ctr   ++ rmmod_args'=()
nvidia-driver-ctr   =++ local nvidia_uvm_refs'=0
nvidia-driver-ctr   ++ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
nvidia-driver-ctr   =++ _unload_driver
nvidia-driver-ctr   =++ local nvidia_modeset_refs'=0
nvidia-driver-ctr   ++ echo 'Stopping NVIDIA persistence daemon...'
nvidia-driver-ctr   =++ local pid'=28102
nvidia-driver-ctr   +++ seq 1 50
nvidia-driver-ctr   ++ kill -SIGTERM 28102
nvidia-driver-ctr   ++ kill -0 28102
nvidia-driver-ctr   ++ sleep 0.1
nvidia-driver-ctr   ++ for i in $(seq 1 50)
nvidia-driver-ctr   ++ kill -0 28102
nvidia-driver-ctr   ++ '[' 2 -eq 50 ']'
nvidia-driver-ctr   ++ echo 'Unloading NVIDIA driver kernel modules...'
nvidia-driver-ctr   =++ nvidia_modeset_refs'=0
nvidia-driver-ctr   ++ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
nvidia-driver-ctr   Unloading NVIDIA driver kernel modules...
nvidia-driver-ctr   ++ for i in $(seq 1 50)
nvidia-driver-ctr   =++ break
nvidia-driver-ctr   ++ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
nvidia-driver-ctr   ++ '[' -f /sys/module/nvidia_modeset/refcnt ']'
nvidia-driver-ctr   ++ rmmod_args+'=("nvidia-modeset")
nvidia-driver-ctr   =++ (( ++nvidia_deps ))
nvidia-driver-ctr   ++ '[' -f /sys/module/nvidia_uvm/refcnt ']'
nvidia-driver-ctr   =++ nvidia_uvm_refs'=2
nvidia-driver-ctr   ++ rmmod_args+'=("nvidia-uvm")
nvidia-driver-ctr   =++ (( ++nvidia_deps ))
nvidia-driver-ctr   =++ nvidia_refs'=455
nvidia-driver-ctr   ++ rmmod_args+'=("nvidia")
nvidia-driver-ctr   ++ '[' 455 -gt 2 ']'
nvidia-driver-ctr   ++ return 1
nvidia-driver-ctr   ++ '[' -f /sys/module/nvidia/refcnt ']'
nvidia-driver-ctr   ++ echo 'Could not unload NVIDIA driver kernel modules, driver is in use'
nvidia-driver-ctr   Could not unload NVIDIA driver kernel modules, driver is in use
nvidia-driver-ctr   ++ return 1
nvidia-driver-ctr   =+ continue
nvidia-driver-ctr   =+ TRUE
nvidia-driver-ctr   + wait 28127
k8s-driver-manager  Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
k8s-driver-manager  nvidia driver module is already loaded with refcount 455
k8s-driver-manager  node/[hostname] labeled
k8s-driver-manager  Waiting for the operator-validator to shutdown
nvidia-device-plugin-ctr    2022/11/28 18:49:53 Received signal "terminated", shutting down.
nvidia-device-plugin-ctr    2022/11/28 18:49:53 Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
nvidia-dcgm-ctr Started host engine version 2.2.3 using port number: 5555 
nvidia-container-toolkit-ctr    time'="2022-11-28T18:49:53Z" level'=info msg'="Cleaning up Runtime"
nvidia-container-toolkit-ctr    time'="2022-11-28T18:49:53Z" level'=info msg'="Shutting Down"
nvidia-container-toolkit-ctr    time'="2022-11-28T18:49:53Z" level'=info msg'="Completed nvidia-toolkit"
nvidia-container-toolkit-ctr    time'="2022-11-28T18:49:53Z" level'=info msg'="Starting 'cleanup' for crio"
nvidia-device-plugin-ctr    2022/11/28 18:49:57 Shutdown of NVML returned: <nil>
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'toolkit-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'driver-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'plugin-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'cuda-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'plugin-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'cuda-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'driver-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'toolkit-ready' is not ready"
k8s-driver-manager  pod/nvidia-operator-validator-6lgb8 condition met
k8s-driver-manager  Waiting for the container-toolkit to shutdown
k8s-driver-manager  Waiting for the device-plugin to shutdown
k8s-driver-manager  Waiting for gpu-feature-discovery to shutdown
k8s-driver-manager  Waiting for dcgm-exporter to shutdown
k8s-driver-manager  Unloading NVIDIA driver kernel modules...
k8s-driver-manager  Unable to cleanup driver modules, attempting again with node drain...
k8s-driver-manager  nvidia_modeset       1196032  0
k8s-driver-manager  nvidia_uvm           1163264  2
k8s-driver-manager  nvidia              35266560  410 nvidia_uvm,nvidia_modeset
k8s-driver-manager  drm                   569344  5 vmwgfx,drm_kms_helper,nvidia,ttm
k8s-driver-manager  Could not unload NVIDIA driver kernel modules, driver is in use
k8s-driver-manager  Draining node [hostname]...
k8s-driver-manager  node/[hostname] cordoned
k8s-driver-manager  error: unable to drain node "[hostname]", aborting command...
k8s-driver-manager  
k8s-driver-manager  error: cannot delete Pods with local storage (use --delete-emptydir-data to override): hogehoge-5f965449c6-hlfcw, hogehoge2-5b4b45f78c-7257c
k8s-driver-manager  The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
k8s-driver-manager  For now, users can try such experience via: --ignore-errors
k8s-driver-manager  There are pending nodes to be drained:
k8s-driver-manager  DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
k8s-driver-manager   [hostname]
k8s-driver-manager  Uncordoning node [hostname]...
k8s-driver-manager  node/[hostname] uncordoned
k8s-driver-manager  Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
k8s-driver-manager  node/[hostname] labeled
k8s-driver-manager  nvidia driver module is already loaded with refcount 410
k8s-driver-manager  Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
k8s-driver-manager  node/[hostname] labeled
k8s-driver-manager  Waiting for the operator-validator to shutdown
toolkit-validation  waiting for nvidia container stack to be setup
toolkit-validation  waiting for nvidia container stack to be setup
toolkit-validation  waiting for nvidia container stack to be setup
toolkit-validation  waiting for nvidia container stack to be setup
k8s-driver-manager  pod/nvidia-operator-validator-s58sw condition met
k8s-driver-manager  Waiting for the container-toolkit to shutdown
toolkit-validation  waiting for nvidia container stack to be setup
toolkit-validation  waiting for nvidia container stack to be setup
nvidia-node-status-exporter time'="2022-11-28T18:50:49Z" level'=info msg'="metrics: StatusFile: 'driver-ready' is ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:49Z" level'=info msg'="metrics: StatusFile: 'driver-ready' is ready"
toolkit-validation  waiting for nvidia container stack to be setup
toolkit-validation  waiting for nvidia container stack to be setup
k8s-driver-manager  pod/nvidia-container-toolkit-daemonset-dvxwm condition met
k8s-driver-manager  Waiting for the device-plugin to shutdown
toolkit-validation  waiting for nvidia container stack to be setup
toolkit-validation  waiting for nvidia container stack to be setup
toolkit-validation  waiting for nvidia container stack to be setup
toolkit-validation  waiting for nvidia container stack to be setup
toolkit-validation  waiting for nvidia container stack to be setup
toolkit-validation  waiting for nvidia container stack to be setup
kousui-dev commented 1 year ago

@shivamerla This URL had the following information: Fixed an issue where Driver Daemonset was spuriously updated on RedHat OpenShift causing repeated restarts in Proxy environments.

what was the cause? Is it possible that Openshift accidentally restarted in this event as well?

Although the Openshift EUS update is in progress, the settings of the GPUOperator pods have not been changed, so it is very mysterious that the pods have restarted on all GPU servers.

And, This URL had the following information:

Maybe there is a problem with entitlement? Does the entitlement have an expiration date?

kousui-dev commented 1 year ago

I know the cause and workaround for this error message, so I don't have to deal with this issue. I want to know why it restarted.

k8s-driver-manager  error: unable to drain node "[hostname]", aborting command...
k8s-driver-manager  
k8s-driver-manager  error: cannot delete Pods with local storage (use --delete-emptydir-data to override): hogehoge-5f965449c6-hlfcw, hogehoge2-5b4b45f78c-7257c
k8s-driver-manager  The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
k8s-driver-manager  For now, users can try such experience via: --ignore-errors
shivamerla commented 1 year ago

@kousui-dev GPU Operator will pass following ENV to driver Daemonset from parsing /etc/os-release file on the node where it was running.

RHEL_VERSION
OPENSHIFT_VERSION

But if worker nodes are not restarted, then this information should not change and should not cause spec update for Driver Daemonset. Entitlements do expire quite often but they are not automatically renewed and need user to apply them again. Proxy config is injected into the driver container as well, so any change to HTTP_PROXY, HTTPS_PROXY etc might cause driver restarts.

kousui-dev commented 1 year ago

@shivamerla Thanks.

I referred to this URL.

imagePullPolicy: IfNotPresent
env:
- name: ENABLE_AUTO_DRAIN
   value: "true"
- name: DRAIN_USE_FORCE
   value: "false"
- name: DRAIN_POD_SELECTOR_LABEL
   value: ""
- name: DRAIN_TIMEOUT_SECONDS
   value: "0s"
- name: DRAIN_DELETE_EMPTYDIR_DATA
   value: "false"

I had ENABLE_AUTO_DRAIN set to "true". Is it possible that this is what caused the GPUPod(such as nvidia-device-plugin-daemonset) to reboot?

Also,should I set it to false to prevent GPUPod(such as nvidia-device-plugin-daemonset) from accidentally restarting?

shivamerla commented 1 year ago

@kousui-dev we are releasing v22.9.1 version of operator next week which will support OnDelete upgradeStrategy for all Daemonsets. With that version these restarts can be avoided and upgrades will be triggered only when you manually delete pods on the node. I think that feature will be beneficial for these cases. ENABLE_AUTO_DRAIN is also disabled by default from next release onwards and only GPU pods will be deleted for driver upgrades instead of node drain.

kousui-dev commented 1 year ago

@shivamerla Thanks.

ENABLE_AUTO_DRAIN is also disabled by default from next release onwards and only GPU pods will be deleted for driver upgrades instead of node drain.

In the current version, Nodedrain is running, so is it correct to try to drain Pods other than gpuPods as well?

Also,I understand the following, is it correct? GPUPod may restart unexpectedly with gpu-operator 1.8.2 and the latest version.

shivamerla commented 1 year ago

@kousui-dev correct it is not required to drain other pods during driver container updates and only GPU clients need to be evicted. That logic is being updated with release next week. With current versions, yes, GPUPods may restart whenever driver container is getting updated. With "OnDelete" support i mentioned, this won't happen until user/admin intervenes to delete old driver pods manually.

kousui-dev commented 1 year ago

@shivamerla What are the cases when the driver container updates?

kousui-dev commented 1 year ago

@shivamerla Is it only the Openshift environment that unexpected reboots occur? Or will the same phenomenon occur in the upstream version of kubernetes?

kousui-dev commented 1 year ago

@shivamerla

That logic is being updated with release next week.

Has a fixed version been released?

shivamerla commented 1 year ago

@kousui-dev Yes, you can now set daemonsets.updateStrategy=OnDelete in ClusterPolicy instance(CR) with v22.9.1. Default setting is still "RollingUpgrade". This will ensure that driver pod will not automatically restart on any spec updates. Manual pod restarts are required for applying any changes. Please note that all operand pods(daemonsets) will have this setting, so for any spec change they would have to be manually restarted. This applies to both OCP and K8s environments.

Below command will restart all operands at once when you want to apply any changes(e.g. version upgrades).

kubectl label node <node-name> nvidia.com/gpu.deploy.operands=false --overwrite
kubectl label node <node-name> nvidia.com/gpu.deploy.operands=true --overwrite