Open abravalheri opened 3 months ago
To investigate this issue, I had a look on the microk8s
code:
https://github.com/canonical/microk8s/blob/v1.30/scripts/wrappers/common/utils.py#L566-L572
With the is_enabled
function defined as a simple substring check.
The origin of the false positive seems to come from this check (maybe it is a bit fragile?), but maybe also the fact that microk8s disable nvidia
does not delete any of the cluster roles created during microk8s enable nvidia
?
$ kubectl get clusterroles --show-kind --no-headers
clusterrole.rbac.authorization.k8s.io/coredns 2024-07-31T18:38:00Z
clusterrole.rbac.authorization.k8s.io/gpu-operator 2024-08-02T14:48:45Z
clusterrole.rbac.authorization.k8s.io/gpu-operator-node-feature-discovery 2024-08-02T14:48:45Z
clusterrole.rbac.authorization.k8s.io/gpu-operator-node-feature-discovery-gc 2024-08-02T14:48:45Z
clusterrole.rbac.authorization.k8s.io/microk8s-hostpath 2024-07-31T15:59:53Z
clusterrole.rbac.authorization.k8s.io/multus 2024-07-31T18:52:46Z
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader 2024-08-01T16:23:09Z
clusterrole.rbac.authorization.k8s.io/system:kube-ovn-app 2024-07-31T18:08:42Z
clusterrole.rbac.authorization.k8s.io/system:kube-ovn-cni 2024-07-31T18:08:42Z
clusterrole.rbac.authorization.k8s.io/system:metrics-server 2024-08-01T16:23:09Z
clusterrole.rbac.authorization.k8s.io/system:ovn 2024-07-31T18:08:42Z
clusterrole.rbac.authorization.k8s.io/system:ovn-ovs 2024-07-31T18:08:42Z
Summary
When trying to enable the NVIDIA addon in MicroK8s, I encountered an error message. After attempting to disable and re-enable the addon, MicroK8s incorrectly states that the addon is already enabled, despite no related pods running in any namespace.
What Should Happen Instead?
Enabling the operator should work without errors. Disabling the operator should actually disable it, and MicroK8s should not incorrectly state that the operator is enabled when it isn't (and microk8s should not refuse to re-enable an addon based on incorrect assumptions).
Detailed Story
When I first tried to enable the addon, it replied me a weird message:
I checked to see if any pod related to the NVIDIA GPU operator was running in any namespace, but there was nothing (no new namespace was created and I had only pods in the
kube-system
namespace).Then I decided to disable and enable it again:
This error was a bit disturbing, but I tried it again:
However, when I tried to enable the addon again, I got a message that it is already enabled:
This seems to be a bug... and it is very difficult because now I am in a limbo: disabling the extension does not seem to do anything, and enabling it again does not work because microk8s seems to believe it is already enabled.
Anyone can help me to solve this problem? And get the
nvidia
addon working?Reproduction Steps
Hardware utilised:
microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0
microk8s disable nvidia
microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0
Introspection Report
inspection-report-20240802_151308.tar.gz
Can you suggest a fix?
I don't have any ideas to solve this problem, looking for help really.
Are you interested in contributing with a fix?
No, I am not able to contribute a fix at this time.