canonical / microk8s-core-addons

Core MicroK8s addons
Apache License 2.0
40 stars 33 forks source link

Bug: `nvidia` addon stuck in enabled state despite unsuccessful deployment followed by explicit `microk8s disable nvidia` #286

Open abravalheri opened 1 month ago

abravalheri commented 1 month ago

Summary

When trying to enable the NVIDIA addon in MicroK8s, I encountered an error message. After attempting to disable and re-enable the addon, MicroK8s incorrectly states that the addon is already enabled, despite no related pods running in any namespace.

What Should Happen Instead?

Enabling the operator should work without errors. Disabling the operator should actually disable it, and MicroK8s should not incorrectly state that the operator is enabled when it isn't (and microk8s should not refuse to re-enable an addon based on incorrect assumptions).

Detailed Story

When I first tried to enable the addon, it replied me a weird message:

$ microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0
Infer repository core for addon nvidia
Addon core/dns is already enabled
Addon core/helm3 is already enabled
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
Deploy NVIDIA GPU operator
Using host GPU driver
Error: INSTALLATION FAILED: Post "https://127.0.0.1:16443/apis/rbac.authorization.k8s.io/v1/namespaces/gpu-operator/roles?fieldManager=helm": unexpected EOF
Deployed NVIDIA GPU operator

I checked to see if any pod related to the NVIDIA GPU operator was running in any namespace, but there was nothing (no new namespace was created and I had only pods in the kube-system namespace).

Then I decided to disable and enable it again:

$ microk8s disable nvidia
Traceback (most recent call last):
  File "/snap/microk8s/7040/scripts/wrappers/disable.py", line 44, in <module>
    disable(prog_name="microk8s disable")
  File "/snap/microk8s/7040/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/snap/microk8s/7040/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/snap/microk8s/7040/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/snap/microk8s/7040/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/snap/microk8s/7040/scripts/wrappers/disable.py", line 40, in disable
    xable("disable", addons)
  File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 470, in xable
    protected_xable(action, addon_args)
  File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 498, in protected_xable
    unprotected_xable(action, addon_args)
  File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 514, in unprotected_xable
    enabled_addons_info, disabled_addons_info = get_status(available_addons_info, True)
  File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 566, in get_status
    kube_output = kubectl_get("all,ingress")
  File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 248, in kubectl_get
    return run(KUBECTL, "get", cmd, "--all-namespaces", die=False)
  File "/snap/microk8s/7040/scripts/wrappers/common/utils.py", line 69, in run
    result.check_returncode()
  File "/snap/microk8s/7040/usr/lib/python3.8/subprocess.py", line 448, in check_returncode
    raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '('/snap/microk8s/7040/microk8s-kubectl.wrapper', 'get', 'all,ingress', '--all-namespaces')' returned non-zero exit status 1.

This error was a bit disturbing, but I tried it again:

$ microk8s disable nvidia
Infer repository core for addon nvidia
Disabling NVIDIA support
NVIDIA support disabled

However, when I tried to enable the addon again, I got a message that it is already enabled:

$ microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0 --force
Infer repository core for addon nvidia
Addon core/nvidia is already enabled

This seems to be a bug... and it is very difficult because now I am in a limbo: disabling the extension does not seem to do anything, and enabling it again does not work because microk8s seems to believe it is already enabled.

Anyone can help me to solve this problem? And get the nvidia addon working?

Reproduction Steps

Hardware utilised:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

$ nvidia-smi
Fri Aug  2 15:09:43 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A2                      Off | 00000000:98:00.0 Off |                    0 |
|  0%   39C    P0              20W /  60W |      0MiB / 15356MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  1. Run microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0
  2. Check for running pods related to the NVIDIA GPU operator in any namespace.
  3. Run microk8s disable nvidia
  4. Run microk8s enable nvidia --no-network-operator --gpu-operator --gpu-operator-driver=host --gpu-operator-version=v24.6.0

Introspection Report

inspection-report-20240802_151308.tar.gz

Can you suggest a fix?

I don't have any ideas to solve this problem, looking for help really.

Are you interested in contributing with a fix?

No, I am not able to contribute a fix at this time.

abravalheri commented 1 month ago

To investigate this issue, I had a look on the microk8s code:

https://github.com/canonical/microk8s/blob/v1.30/scripts/wrappers/common/utils.py#L566-L572

With the is_enabled function defined as a simple substring check.

The origin of the false positive seems to come from this check (maybe it is a bit fragile?), but maybe also the fact that microk8s disable nvidia does not delete any of the cluster roles created during microk8s enable nvidia?

$ kubectl get clusterroles --show-kind --no-headers
clusterrole.rbac.authorization.k8s.io/coredns                                  2024-07-31T18:38:00Z
clusterrole.rbac.authorization.k8s.io/gpu-operator                             2024-08-02T14:48:45Z
clusterrole.rbac.authorization.k8s.io/gpu-operator-node-feature-discovery      2024-08-02T14:48:45Z
clusterrole.rbac.authorization.k8s.io/gpu-operator-node-feature-discovery-gc   2024-08-02T14:48:45Z
clusterrole.rbac.authorization.k8s.io/microk8s-hostpath                        2024-07-31T15:59:53Z
clusterrole.rbac.authorization.k8s.io/multus                                   2024-07-31T18:52:46Z
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader         2024-08-01T16:23:09Z
clusterrole.rbac.authorization.k8s.io/system:kube-ovn-app                      2024-07-31T18:08:42Z
clusterrole.rbac.authorization.k8s.io/system:kube-ovn-cni                      2024-07-31T18:08:42Z
clusterrole.rbac.authorization.k8s.io/system:metrics-server                    2024-08-01T16:23:09Z
clusterrole.rbac.authorization.k8s.io/system:ovn                               2024-07-31T18:08:42Z
clusterrole.rbac.authorization.k8s.io/system:ovn-ovs                           2024-07-31T18:08:42Z