Open andy108369 opened 3 months ago
Testing this on Cato provider that had 0 leases since yesterday. Currently am hitting this issue https://github.com/NVIDIA/k8s-device-plugin/issues/856
Figured the issue is because new nvidia-device-plugin 0.16.x helm-charts (0.16.0 rc1, 0.16.0, 0.16.1) are dropping SYS_ADMIN
capability leading to unable to create plugin manager: nvml init failed: ERROR_LIBRARY_NOT_FOUND
error.
Let's keep using nvidia-device-plugin 0.15.1 until https://github.com/NVIDIA/k8s-device-plugin/issues/856 gets fixed or a better workaround is found instead of modifying/customizing the helm-chart manually.
For the record: Restarting nvidia-device-plugin/nvdp
, even uninstalling it - does not impact on already existing & active GPU workloads. It will impact them if their pod will get restarted. It will go into Pending
state until it finds a worker node with the GPU. If nvdp
plugin is not running, the pod will go into Pending
state forever.
And it does not change the reported CUDA version upon nvidia-smi | grep Version
as expected. (since for that there are cuda-compat-<ver>
packages + LD_LIBRARY_PATH
method to load them up)
The quick workaround is to pass securityContext.capabilities.add[0]=SYS_ADMIN
to the chart, e.g.:
helm upgrade --install nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.16.1 \
--set runtimeClassName="nvidia" \
--set deviceListStrategy=volume-mounts \
--set securityContext.capabilities.add[0]=SYS_ADMIN
Going to update our docs after a better fix is released to issue 856.
k8s-device-plugin
v0.16.1 got released 3 days ago: They have updated CUDA base image version to12.5.1
among the other changes https://github.com/NVIDIA/k8s-device-plugin/releasesNeed to test the following:
nvidia-device-plugin
helm chart up to0.16.1
without impacting existing GPU deployments (can probably pick some provider with least used GPUs; probably sandbox will do best)nvidia-smi | grep Version
(probably this isn't related, but still worth checking)0.15.1
to0.16.1
version in the docs https://akash.network/docs/providers/build-a-cloud-provider/gpu-resource-enablement/nvidia-device-plugin
across all the GPU providers