Bumped to the latest GPU Operator and supporting tools.
Even though the supported path forward is to only use the GPU Operator, I am continuing to maintain and test the paths using Device Plugin without the operator.
Also pushed some changed in here to address the SSL issues introduced by recent updates to the fedoraproject website.
Testing:
The existing test already cover GFD+device plugin and GPU Operator installs. It looks like this release added a few additional features and upgrade paths that could eventually be included in our testing harness, but I have not done so for this PR.
Upgrade steps:
For existing clusters an upgrade can be done by:
Tainting all GPU nodes as NoSchedule and evacuating all running GPU workloads or waiting for them to complete
Bumped to the latest GPU Operator and supporting tools.
Even though the supported path forward is to only use the GPU Operator, I am continuing to maintain and test the paths using Device Plugin without the operator.
GPU Operator -> v22.9.2 (NGC hosted) Device Plugin -> v0.13.0 GFD -> v0.7.0
Also pushed some changed in here to address the SSL issues introduced by recent updates to the fedoraproject website.
Testing: The existing test already cover GFD+device plugin and GPU Operator installs. It looks like this release added a few additional features and upgrade paths that could eventually be included in our testing harness, but I have not done so for this PR.
Upgrade steps: For existing clusters an upgrade can be done by:
helm delete -n gpu-operator-resources nvidia-gpu-operator
kubectl delete crd clusterpolicies.nvidia.com
ansible-playbook playbooks/k8s-cluster/nvidia-gpu-operator.yml