Open kousui-dev opened 1 year ago
@kousui-dev Can you get history of the daemonset rollouts that have happened as part of EUS to EUS upgrades? If driver daemonset is also restarted, please get the status of that to compare the spec change. Driver pod will restart all other pods as they have a dependency.
kubectl rollout history --revision=0 ds/nvidia-driver-daemonset -n nvidia-gpu-operator -o yaml
kubectl rollout history --revision=1 ds/nvidia-driver-daemonset -n nvidia-gpu-operator -o yaml
@shivamerla Since it has already been updated to GPU-Operator1.11, it is not possible to obtain the log.
I have obtained the log of the gpu-operator-resources container on the GPU server at that time, so I will share it. It seems that a reboot has occurred due to receiving some kind of signal. In addition, the gpu-operator-resources container has been restarted on all 88 GPU servers, and the same log can be confirmed for all.
Is there any reason why you are concerned?
kubernetes.container_name message
nvidia-driver-ctr Caught signal
nvidia-driver-ctr ++ echo 'Caught signal'
nvidia-driver-ctr =++ _shutdown
nvidia-driver-ctr =++ local rmmod_args
nvidia-driver-ctr =++ local nvidia_deps'=0
nvidia-driver-ctr =++ local nvidia_refs'=0
nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-ctr ++ rmmod_args'=()
nvidia-driver-ctr =++ local nvidia_uvm_refs'=0
nvidia-driver-ctr ++ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
nvidia-driver-ctr =++ _unload_driver
nvidia-driver-ctr =++ local nvidia_modeset_refs'=0
nvidia-driver-ctr ++ echo 'Stopping NVIDIA persistence daemon...'
nvidia-driver-ctr =++ local pid'=28102
nvidia-driver-ctr +++ seq 1 50
nvidia-driver-ctr ++ kill -SIGTERM 28102
nvidia-driver-ctr ++ kill -0 28102
nvidia-driver-ctr ++ sleep 0.1
nvidia-driver-ctr ++ for i in $(seq 1 50)
nvidia-driver-ctr ++ kill -0 28102
nvidia-driver-ctr ++ '[' 2 -eq 50 ']'
nvidia-driver-ctr ++ echo 'Unloading NVIDIA driver kernel modules...'
nvidia-driver-ctr =++ nvidia_modeset_refs'=0
nvidia-driver-ctr ++ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-ctr ++ for i in $(seq 1 50)
nvidia-driver-ctr =++ break
nvidia-driver-ctr ++ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
nvidia-driver-ctr ++ '[' -f /sys/module/nvidia_modeset/refcnt ']'
nvidia-driver-ctr ++ rmmod_args+'=("nvidia-modeset")
nvidia-driver-ctr =++ (( ++nvidia_deps ))
nvidia-driver-ctr ++ '[' -f /sys/module/nvidia_uvm/refcnt ']'
nvidia-driver-ctr =++ nvidia_uvm_refs'=2
nvidia-driver-ctr ++ rmmod_args+'=("nvidia-uvm")
nvidia-driver-ctr =++ (( ++nvidia_deps ))
nvidia-driver-ctr =++ nvidia_refs'=455
nvidia-driver-ctr ++ rmmod_args+'=("nvidia")
nvidia-driver-ctr ++ '[' 455 -gt 2 ']'
nvidia-driver-ctr ++ return 1
nvidia-driver-ctr ++ '[' -f /sys/module/nvidia/refcnt ']'
nvidia-driver-ctr ++ echo 'Could not unload NVIDIA driver kernel modules, driver is in use'
nvidia-driver-ctr Could not unload NVIDIA driver kernel modules, driver is in use
nvidia-driver-ctr ++ return 1
nvidia-driver-ctr =+ continue
nvidia-driver-ctr =+ TRUE
nvidia-driver-ctr + wait 28127
k8s-driver-manager Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
k8s-driver-manager nvidia driver module is already loaded with refcount 455
k8s-driver-manager node/[hostname] labeled
k8s-driver-manager Waiting for the operator-validator to shutdown
nvidia-device-plugin-ctr 2022/11/28 18:49:53 Received signal "terminated", shutting down.
nvidia-device-plugin-ctr 2022/11/28 18:49:53 Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
nvidia-dcgm-ctr Started host engine version 2.2.3 using port number: 5555
nvidia-container-toolkit-ctr time'="2022-11-28T18:49:53Z" level'=info msg'="Cleaning up Runtime"
nvidia-container-toolkit-ctr time'="2022-11-28T18:49:53Z" level'=info msg'="Shutting Down"
nvidia-container-toolkit-ctr time'="2022-11-28T18:49:53Z" level'=info msg'="Completed nvidia-toolkit"
nvidia-container-toolkit-ctr time'="2022-11-28T18:49:53Z" level'=info msg'="Starting 'cleanup' for crio"
nvidia-device-plugin-ctr 2022/11/28 18:49:57 Shutdown of NVML returned: <nil>
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'toolkit-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'driver-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'plugin-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'cuda-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'plugin-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'cuda-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'driver-ready' is not ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:19Z" level'=info msg'="metrics: StatusFile: 'toolkit-ready' is not ready"
k8s-driver-manager pod/nvidia-operator-validator-6lgb8 condition met
k8s-driver-manager Waiting for the container-toolkit to shutdown
k8s-driver-manager Waiting for the device-plugin to shutdown
k8s-driver-manager Waiting for gpu-feature-discovery to shutdown
k8s-driver-manager Waiting for dcgm-exporter to shutdown
k8s-driver-manager Unloading NVIDIA driver kernel modules...
k8s-driver-manager Unable to cleanup driver modules, attempting again with node drain...
k8s-driver-manager nvidia_modeset 1196032 0
k8s-driver-manager nvidia_uvm 1163264 2
k8s-driver-manager nvidia 35266560 410 nvidia_uvm,nvidia_modeset
k8s-driver-manager drm 569344 5 vmwgfx,drm_kms_helper,nvidia,ttm
k8s-driver-manager Could not unload NVIDIA driver kernel modules, driver is in use
k8s-driver-manager Draining node [hostname]...
k8s-driver-manager node/[hostname] cordoned
k8s-driver-manager error: unable to drain node "[hostname]", aborting command...
k8s-driver-manager
k8s-driver-manager error: cannot delete Pods with local storage (use --delete-emptydir-data to override): hogehoge-5f965449c6-hlfcw, hogehoge2-5b4b45f78c-7257c
k8s-driver-manager The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
k8s-driver-manager For now, users can try such experience via: --ignore-errors
k8s-driver-manager There are pending nodes to be drained:
k8s-driver-manager DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
k8s-driver-manager [hostname]
k8s-driver-manager Uncordoning node [hostname]...
k8s-driver-manager node/[hostname] uncordoned
k8s-driver-manager Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
k8s-driver-manager node/[hostname] labeled
k8s-driver-manager nvidia driver module is already loaded with refcount 410
k8s-driver-manager Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
k8s-driver-manager node/[hostname] labeled
k8s-driver-manager Waiting for the operator-validator to shutdown
toolkit-validation waiting for nvidia container stack to be setup
toolkit-validation waiting for nvidia container stack to be setup
toolkit-validation waiting for nvidia container stack to be setup
toolkit-validation waiting for nvidia container stack to be setup
k8s-driver-manager pod/nvidia-operator-validator-s58sw condition met
k8s-driver-manager Waiting for the container-toolkit to shutdown
toolkit-validation waiting for nvidia container stack to be setup
toolkit-validation waiting for nvidia container stack to be setup
nvidia-node-status-exporter time'="2022-11-28T18:50:49Z" level'=info msg'="metrics: StatusFile: 'driver-ready' is ready"
nvidia-node-status-exporter time'="2022-11-28T18:50:49Z" level'=info msg'="metrics: StatusFile: 'driver-ready' is ready"
toolkit-validation waiting for nvidia container stack to be setup
toolkit-validation waiting for nvidia container stack to be setup
k8s-driver-manager pod/nvidia-container-toolkit-daemonset-dvxwm condition met
k8s-driver-manager Waiting for the device-plugin to shutdown
toolkit-validation waiting for nvidia container stack to be setup
toolkit-validation waiting for nvidia container stack to be setup
toolkit-validation waiting for nvidia container stack to be setup
toolkit-validation waiting for nvidia container stack to be setup
toolkit-validation waiting for nvidia container stack to be setup
toolkit-validation waiting for nvidia container stack to be setup
@shivamerla
This URL had the following information:
Fixed an issue where Driver Daemonset was spuriously updated on RedHat OpenShift causing repeated restarts in Proxy environments.
what was the cause? Is it possible that Openshift accidentally restarted in this event as well?
Although the Openshift EUS update is in progress, the settings of the GPUOperator pods have not been changed, so it is very mysterious that the pods have restarted on all GPU servers.
And, This URL had the following information:
Maybe there is a problem with entitlement? Does the entitlement have an expiration date?
I know the cause and workaround for this error message, so I don't have to deal with this issue. I want to know why it restarted.
k8s-driver-manager error: unable to drain node "[hostname]", aborting command...
k8s-driver-manager
k8s-driver-manager error: cannot delete Pods with local storage (use --delete-emptydir-data to override): hogehoge-5f965449c6-hlfcw, hogehoge2-5b4b45f78c-7257c
k8s-driver-manager The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
k8s-driver-manager For now, users can try such experience via: --ignore-errors
@kousui-dev GPU Operator will pass following ENV to driver Daemonset from parsing /etc/os-release
file on the node where it was running.
RHEL_VERSION
OPENSHIFT_VERSION
But if worker nodes are not restarted, then this information should not change and should not cause spec update for Driver Daemonset. Entitlements do expire quite often but they are not automatically renewed and need user to apply them again. Proxy config is injected into the driver container as well, so any change to HTTP_PROXY, HTTPS_PROXY
etc might cause driver restarts.
@shivamerla Thanks.
I referred to this URL.
imagePullPolicy: IfNotPresent
env:
- name: ENABLE_AUTO_DRAIN
value: "true"
- name: DRAIN_USE_FORCE
value: "false"
- name: DRAIN_POD_SELECTOR_LABEL
value: ""
- name: DRAIN_TIMEOUT_SECONDS
value: "0s"
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "false"
I had ENABLE_AUTO_DRAIN set to "true". Is it possible that this is what caused the GPUPod(such as nvidia-device-plugin-daemonset) to reboot?
Also,should I set it to false to prevent GPUPod(such as nvidia-device-plugin-daemonset) from accidentally restarting?
@kousui-dev we are releasing v22.9.1
version of operator next week which will support OnDelete
upgradeStrategy for all Daemonsets. With that version these restarts can be avoided and upgrades will be triggered only when you manually delete pods on the node. I think that feature will be beneficial for these cases. ENABLE_AUTO_DRAIN
is also disabled by default from next release onwards and only GPU pods will be deleted for driver upgrades instead of node drain.
@shivamerla Thanks.
ENABLE_AUTO_DRAIN is also disabled by default from next release onwards and only GPU pods will be deleted for driver upgrades instead of node drain.
In the current version, Nodedrain is running, so is it correct to try to drain Pods other than gpuPods as well?
Also,I understand the following, is it correct? GPUPod may restart unexpectedly with gpu-operator 1.8.2 and the latest version.
@kousui-dev correct it is not required to drain other pods during driver container updates and only GPU clients need to be evicted. That logic is being updated with release next week. With current versions, yes, GPUPods may restart whenever driver container is getting updated. With "OnDelete" support i mentioned, this won't happen until user/admin intervenes to delete old driver pods manually.
@shivamerla What are the cases when the driver container updates?
@shivamerla Is it only the Openshift environment that unexpected reboots occur? Or will the same phenomenon occur in the upstream version of kubernetes?
@shivamerla
That logic is being updated with release next week.
Has a fixed version been released?
@kousui-dev Yes, you can now set daemonsets.updateStrategy=OnDelete
in ClusterPolicy instance(CR) with v22.9.1. Default setting is still "RollingUpgrade". This will ensure that driver pod will not automatically restart on any spec updates. Manual pod restarts are required for applying any changes. Please note that all operand pods(daemonsets) will have this setting, so for any spec change they would have to be manually restarted. This applies to both OCP and K8s environments.
Below command will restart all operands at once when you want to apply any changes(e.g. version upgrades).
kubectl label node <node-name> nvidia.com/gpu.deploy.operands=false --overwrite
kubectl label node <node-name> nvidia.com/gpu.deploy.operands=true --overwrite
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
1. Issue or feature description
Hello everyone,
OCP4.8.41 with GPU-Operator1.8.2 installed has been upgraded to OCP4.10.26 by EUS update. The composition of the OCP crust is as follows. ・Master node/3 units ・infra node / 3 units ・Worker node /88 units
When updating from OCP4.8.41 to OCP4.9.45, machineconfig settings prevent worker nodes from rebooting.
Reference: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.8/html-single/updating_clusters/index#updating-eus-to-eus-upgrade_eus-to-eus-upgrade
After a while after executing the version upgrade command in that state, the Pods managed by GPU-Operator (such as nvidia-device-plugin-daemonset ) were restarted.
I am doing the same operation in another cluster, No Pod restarts occurred then.
Which is the correct behavior? If not rebooting is the correct behavior, what caused the reboot?
thanks.