Open v-ctiutiu opened 2 years ago
@v-ctiutiu Thank you for raising this issue. TVK Engg team is looking into it.
We will reproduce it in-house with kube-prom-stack
and keep you updated with the progress on it.
We have found the root cause of the issue. It seems to be a logic error in our restore hooks for native Helm chart support. We will be fixing this in the upcoming patch release v2.6.4 and as of today we will update our release notes with this as a 'known issue'
@v-ctiutiu Thank you for your patience on this one. We have fixed this issue end-to-end and now the helm application will be restored the helm way and it would be ready to be used after the restore is complete. This fix is released as a part of TVK 2.7.1 release. Here are the release notes. Let me know if you face any issues.
Problem Description
When trying to restore a full backup that includes
Prometheus
as one of the backed up components, thekube-prome-operator
component fails to start.Impacted Areas
TrilioVault for Kubernetes
Namespaced
orMulti-Namespaced
restore operations.Prerequisites
Prometheus must be deployed in your DOKS cluster as per Starter Kit guide.
Steps to Reproduce
Prometheus
instance running in yourDOKS
cluster.TrilioVault for Kubernetes
installed and configured, as described in Installing TrilioVault for Kubernetes chapter.kubectl get ns kube-system -o jsonpath='{.metadata.uid}'
.monitoring
as per Starter Kit).helm delete kube-prom-stack -n monitoring
S3 Target
using the TVK web management console.Expected Results
The
monitoring
namespace applications (including Prometheus) backup and restore process should go smoothly, and without any issues. All Prometheus stack components should be up and running (Pods
,Services
, etc).Actual Results
The restore process completes successfully, but the
Prometheus Operator
(orkube-prome-operator
) is refusing to start. Runningkubectl get pods -n monitoring
yields:Going further, and issuing
kubectl describe pod/kube-prom-szubu-kube-prome-operator-8649bb7b47-9qs8j -n monitoring
yields:Seems that
kube-prome-operator
fails to find the secret namedkube-prom-szubu-kube-prome-admission
. Listing all the secrets from the monitoring namespace viakubectl get secrets -n monitoring
, yields (notice that there's a secret namedkube-prom-stack-kube-prome-admission
which seems to be the right one):Looking at the
Prometheus Operator
deployment viakubectl get deployment kube-prom-szubu-kube-prome-operator -o yaml
, you can notice that the secret name was changed tokube-prom-szubu-kube-prome-admission
(TVK replacedstack
withszubu
):Next, after editing the deployment via
kubectl edit deployment kube-prom-szubu-kube-prome-operator -n monitoring
and replacing the secret name with the proper onekube-prom-stack-kube-prome-admission
, thePrometheus Operator
starts successfully:The output looks like below:
Everything seems back to normal now, as seen above.
After analysing everything that happened so far, it seems that
TVK
isrenaming
the Kubernetes resources in the backup/restore process using some internal logic or naming convention, but when restoring there are consistency problems.