flant / addon-operator

A system to manage additional components for Kubernetes cluster in a simple, consistent and automated way.
https://flant.github.io/addon-operator/
Apache License 2.0
483 stars 27 forks source link

feat: applyCRDs on start main queue #497

Closed juev closed 1 month ago

juev commented 1 month ago

Overview

We want to get rid from crd modules, like operator-prometheus-crd. Instead of that we have to discover CRDs and install them automatically on startup.

What this PR does / why we need it

Fixed: #9387

Special notes for your reviewer

When an error occurs during the crds application stage, we record them, but repeat them endlessly.

juev commented 1 month ago

How I tested

Correct module

I created a separate branch in the DH project, where I used this version of addon-operator. In this branch, I removed the call to the ensure_crds hook from a number of modules. All tests were performed with the working multitenancy-manager module.

Then I deleted the module-related crds from the dev cluster. Restarted the cluster.

After the restart was completed, I saw that the module crds were restored.

Then I disabled the module. Removing its crd from the dev cluster. And restarting the controller.

After the restart, the module remained disabled, and its crd was also missing, i.e. not processed.

Failed module

I then made changes to the DH branch that resulted in an incorrect crd file for the module I was using.

After removing crd from the dev cluster and restarting the controller, we get a stop of the task processing queue indicating the error that occurred.

Screenshot 2024-08-26 at 15 39 02 Screenshot 2024-08-26 at 16 48 44 Screenshot 2024-08-26 at 17 20 51

The task queue will be stopped until the error in the crd file is corrected. We can also manually disable the module. This will restart the task queue.

juev commented 1 month ago

I could be wrong, but if my memory serves me well, the idea was to install all module's CRDs in one step before starting the modules converge process. In this PR, we apply CRDs gradually along with processing HandleModuleRun tasks. It still can lead to a problem when a module with weight (order) 900 has some custom CRDS to apply to the cluster and we have some validating webhook for the module's CRDs to catch. Thus, the webhook tries to prepare a handler, but the CRD's aren't installed in the cluster yet, causing the handler to fail.

Good remark. I missed it. I’ll correct it.

juev commented 1 month ago

How I tested

Ensure CRDs

First, I checked that EnsureCRDs starts before all modules start. To do this, I looked at the DH logs. Example of execution lines. The lines were before the first module started.

Screenshot 2024-08-28 at 18 34 46

Then I enabled the module that had the ensure_crds hook removed. After the queue had passed, all the necessary crds were in the cluster.

Failed CRDs

Then I checked how the queue works in case of broken CRDs in the modules.

To do this, I created broken CRDs in the DH branch and enabled this module.

Screenshot 2024-08-28 at 18 38 46

The queue in this case arises at the stage of using broken CRDs. We do not reach the start of the modules.

The same thing happens when you restart the controller.

If after this we turn off the module, the queue passes, but we see an active task for applying CRDs with broken files.

Which results in the queue moving very slowly.

Screenshot 2024-08-28 at 18 44 38

Restarting the controller removes this task and the queue goes through without problems quite quickly.

juev commented 1 month ago

Minor update to log output (error example):

Screenshot 2024-08-29 at 12 05 41
juev commented 1 month ago

After using Walk to find crd files, testing failed

I removed crd for multitenancy-manager and enabled it. Installation stopped with ClusterRoleBinding issue:

Screenshot 2024-08-31 at 20 26 57