digitalocean / Kubernetes-Starter-Kit-Developers

Hands-on tutorial and Automation stack for an operations-ready DigitalOcean Kubernetes (DOKS) cluster.
745 stars 258 forks source link

[TVK] Restoring Prometheus from a full backup renders the kube-prome-operator unusable #88

Open v-ctiutiu opened 2 years ago

v-ctiutiu commented 2 years ago

Problem Description

When trying to restore a full backup that includes Prometheus as one of the backed up components, the kube-prome-operator component fails to start.

Impacted Areas

TrilioVault for Kubernetes Namespaced or Multi-Namespaced restore operations.

Prerequisites

Prometheus must be deployed in your DOKS cluster as per Starter Kit guide.

Steps to Reproduce

  1. First, please follow the main guide for Installing the Prometheus Stack, to have a Prometheus instance running in your DOKS cluster.
  2. Then, have TrilioVault for Kubernetes installed and configured, as described in Installing TrilioVault for Kubernetes chapter.
  3. Activate a Clustered license type here. You can fetch the kube-system UID via: kubectl get ns kube-system -o jsonpath='{.metadata.uid}'.
  4. Next, make sure to configure and create a TVK Target for backups storage.
  5. Then, create a TVK Namespaced backup for Prometheus (default namespace is monitoring as per Starter Kit).
  6. Wait for the backup to complete successfully, then delete the Prometheus Helm release: helm delete kube-prom-stack -n monitoring
  7. Initiate a restore directly from the S3 Target using the TVK web management console.

Expected Results

The monitoring namespace applications (including Prometheus) backup and restore process should go smoothly, and without any issues. All Prometheus stack components should be up and running (Pods, Services, etc).

Actual Results

The restore process completes successfully, but the Prometheus Operator (or kube-prome-operator) is refusing to start. Running kubectl get pods -n monitoring yields:

NAME                                                   READY   STATUS              RESTARTS   AGE
kube-prom-szubu-grafana-5754d5b7b7-v97v2               2/2     Running             0          16m
kube-prom-szubu-kube-prome-operator-8649bb7b47-9qs8j   0/1     ContainerCreating   0          16m
kube-prom-szubu-kube-state-metrics-7f6f67d67f-8zfkh    1/1     Running             0          16m
kube-prom-szubu-prometheus-node-exporter-dlb44         1/1     Running             0          16m
kube-prom-szubu-prometheus-node-exporter-wktv7         1/1     Running             0          16m

Going further, and issuing kubectl describe pod/kube-prom-szubu-kube-prome-operator-8649bb7b47-9qs8j -n monitoring yields:

Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    6m6s                  default-scheduler  Successfully assigned monitoring/kube-prom-szubu-kube-prome-operator-8649bb7b47-9qs8j to flux-test-mt-pool-ug7di
  Warning  FailedMount  116s (x10 over 6m6s)  kubelet            MountVolume.SetUp failed for volume "tls-secret" : secret "kube-prom-szubu-kube-prome-admission" not found
  Warning  FailedMount  106s (x2 over 4m3s)   kubelet            Unable to attach or mount volumes: unmounted volumes=[tls-secret], unattached volumes=[tls-secret kube-api-access-bngnb]: timed out waiting for the condition

Seems that kube-prome-operator fails to find the secret named kube-prom-szubu-kube-prome-admission. Listing all the secrets from the monitoring namespace viakubectl get secrets -n monitoring, yields (notice that there's a secret named kube-prom-stack-kube-prome-admission which seems to be the right one):

NAME                                                   TYPE                                  DATA   AGE
alertmanager-kube-prom-szubu-kube-prome-alertmanager   Opaque                                1      19m
default-token-tsjk5                                    kubernetes.io/service-account-token   3      98m
kube-prom-stack-kube-prome-admission                   Opaque                                3      97m
kube-prom-szubu-grafana                                Opaque                                3      19m
...

Looking at the Prometheus Operator deployment via kubectl get deployment kube-prom-szubu-kube-prome-operator -o yaml, you can notice that the secret name was changed to kube-prom-szubu-kube-prome-admission (TVK replaced stack with szubu):

...
volumes:
      - name: tls-secret
        secret:
          defaultMode: 420
          secretName: kube-prom-szubu-kube-prome-admission
...

Next, after editing the deployment via kubectl edit deployment kube-prom-szubu-kube-prome-operator -n monitoring and replacing the secret name with the proper one kube-prom-stack-kube-prome-admission, the Prometheus Operator starts successfully:

kubectl get pods -n monitoring

The output looks like below:

NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-kube-prom-szubu-kube-prome-alertmanager-0   2/2     Running   0          3m42s
kube-prom-szubu-grafana-5754d5b7b7-v97v2                 2/2     Running   0          33m
kube-prom-szubu-kube-prome-operator-bdb6bc8d-4rn9m       1/1     Running   0          3m44s
kube-prom-szubu-kube-state-metrics-7f6f67d67f-8zfkh      1/1     Running   0          33m
kube-prom-szubu-prometheus-node-exporter-dlb44           1/1     Running   0          33m
kube-prom-szubu-prometheus-node-exporter-wktv7           1/1     Running   0          33m
prometheus-kube-prom-szubu-kube-prome-prometheus-0       2/2     Running   0          3m42s

Everything seems back to normal now, as seen above.

After analysing everything that happened so far, it seems that TVK is renaming the Kubernetes resources in the backup/restore process using some internal logic or naming convention, but when restoring there are consistency problems.

trilio-bhagirath commented 2 years ago

@v-ctiutiu Thank you for raising this issue. TVK Engg team is looking into it. We will reproduce it in-house with kube-prom-stack and keep you updated with the progress on it.

trilio-bhagirath commented 2 years ago

We have found the root cause of the issue. It seems to be a logic error in our restore hooks for native Helm chart support. We will be fixing this in the upcoming patch release v2.6.4 and as of today we will update our release notes with this as a 'known issue'

bhagirathhapse commented 2 years ago

@v-ctiutiu Thank you for your patience on this one. We have fixed this issue end-to-end and now the helm application will be restored the helm way and it would be ready to be used after the restore is complete. This fix is released as a part of TVK 2.7.1 release. Here are the release notes. Let me know if you face any issues.