akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

bump `nvidia-device-plugin` to `v0.16.1` #242

Open andy108369 opened 3 months ago

andy108369 commented 3 months ago

k8s-device-plugin v0.16.1 got released 3 days ago: They have updated CUDA base image version to 12.5.1 among the other changes https://github.com/NVIDIA/k8s-device-plugin/releases

Need to test the following:

andy108369 commented 3 months ago

Testing this on Cato provider that had 0 leases since yesterday. Currently am hitting this issue https://github.com/NVIDIA/k8s-device-plugin/issues/856

andy108369 commented 3 months ago

Figured the issue is because new nvidia-device-plugin 0.16.x helm-charts (0.16.0 rc1, 0.16.0, 0.16.1) are dropping SYS_ADMIN capability leading to unable to create plugin manager: nvml init failed: ERROR_LIBRARY_NOT_FOUND error.

Let's keep using nvidia-device-plugin 0.15.1 until https://github.com/NVIDIA/k8s-device-plugin/issues/856 gets fixed or a better workaround is found instead of modifying/customizing the helm-chart manually.

andy108369 commented 3 months ago

For the record: Restarting nvidia-device-plugin/nvdp, even uninstalling it - does not impact on already existing & active GPU workloads. It will impact them if their pod will get restarted. It will go into Pending state until it finds a worker node with the GPU. If nvdp plugin is not running, the pod will go into Pending state forever.

And it does not change the reported CUDA version upon nvidia-smi | grep Version as expected. (since for that there are cuda-compat-<ver> packages + LD_LIBRARY_PATH method to load them up)

andy108369 commented 3 months ago

Workaround

The quick workaround is to pass securityContext.capabilities.add[0]=SYS_ADMIN to the chart, e.g.:

helm upgrade --install nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.16.1 \
  --set runtimeClassName="nvidia" \
  --set deviceListStrategy=volume-mounts \
  --set securityContext.capabilities.add[0]=SYS_ADMIN
andy108369 commented 3 months ago

Going to update our docs after a better fix is released to issue 856.