clastix / kamaji

Kamaji is the Hosted Control Plane Manager for Kubernetes.
https://kamaji.clastix.io
Apache License 2.0
992 stars 90 forks source link

Kamaji broken after namespace removal #491

Closed gecube closed 1 month ago

gecube commented 1 month ago
2024-07-15T11:19:08Z    ERROR   controller-runtime.source.EventHandler  failed to get informer from cache   {"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: admissionregistration.k8s.io/v1: Get \"https://kubernetes-test0.tenant-test0.svc:6443/apis/admissionregistration.k8s.io/v1?timeout=10s\": context deadline exceeded"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/source/kind.go:56
2024-07-15T11:19:09Z    INFO    starting CertificateLifecycle handling  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-konnectivity-certificate","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-konnectivity-certificate", "reconcileID": "ade0284e-a277-4c70-848a-5138b70dddfb"}
2024-07-15T11:19:09Z    INFO    certificate is still valid, enqueuing back  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-konnectivity-certificate","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-konnectivity-certificate", "reconcileID": "ade0284e-a277-4c70-848a-5138b70dddfb", "after": "87460h59m37.007380834s"}
{"level":"warn","ts":"2024-07-15T11:19:09.490703Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d82380/etcd-0.etcd-headless.tenant-root.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2024-07-15T11:19:11Z    INFO    starting CertificateLifecycle handling  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-front-proxy-client-certificate","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-front-proxy-client-certificate", "reconcileID": "b6da9941-a533-461d-bd5d-bb5245a328c5"}
2024-07-15T11:19:12Z    INFO    certificate is still valid, enqueuing back  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-front-proxy-client-certificate","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-front-proxy-client-certificate", "reconcileID": "b6da9941-a533-461d-bd5d-bb5245a328c5", "after": "8572h56m3.319115679s"}
2024-07-15T11:19:13Z    INFO    starting CertificateLifecycle handling  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-api-server-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-api-server-certificate", "reconcileID": "9455cd48-738a-4578-93a4-8aaff668e902"}
{"level":"warn","ts":"2024-07-15T11:19:11.887367Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d82380/etcd-0.etcd-headless.tenant-root.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2024-07-15T11:19:14Z    INFO    certificate is still valid, enqueuing back  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-api-server-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-api-server-certificate", "reconcileID": "9455cd48-738a-4578-93a4-8aaff668e902", "after": "7438h49m57.410941272s"}
2024-07-15T11:19:13Z    ERROR   unable to delete datastore data {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"kubernetes-test0","namespace":"tenant-georg"}, "namespace": "tenant-georg", "name": "kubernetes-test0", "reconcileID": "e1226656-9803-4468-87f4-735077b3dd88", "resource": "datastore-setup", "error": "unable to delete the datastore: cannot delete database: context deadline exceeded", "errorVerbose": "context deadline exceeded\ncannot delete database\ngithub.com/clastix/kamaji/internal/datastore/errors.NewCannotDeleteDatabaseError\n\t/workspace/internal/datastore/errors/errors.go:29\ngithub.com/clastix/kamaji/internal/datastore.(*EtcdClient).DeleteDB\n\t/workspace/internal/datastore/etcd.go:133\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:211\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nunable to delete the datastore\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:212\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
github.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete
    /workspace/internal/resources/datastore/datastore_setup.go:147
github.com/clastix/kamaji/internal/resources.HandleDeletion
    /workspace/internal/resources/resource.go:88
github.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile
    /workspace/controllers/tenantcontrolplane_controller.go:151
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:123
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:270
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:231
2024-07-15T11:19:15Z    INFO    starting CertificateLifecycle handling  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-api-server-kubelet-client-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-api-server-kubelet-client-certificate", "reconcileID": "1df91c30-3fbe-4d91-8796-d79c011b8c18"}
2024-07-15T11:19:15Z    ERROR   resource deletion failed    {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"kubernetes-test0","namespace":"tenant-georg"}, "namespace": "tenant-georg", "name": "kubernetes-test0", "reconcileID": "e1226656-9803-4468-87f4-735077b3dd88", "resource": "datastore-setup", "error": "unable to delete the datastore: cannot delete database: context deadline exceeded", "errorVerbose": "context deadline exceeded\ncannot delete database\ngithub.com/clastix/kamaji/internal/datastore/errors.NewCannotDeleteDatabaseError\n\t/workspace/internal/datastore/errors/errors.go:29\ngithub.com/clastix/kamaji/internal/datastore.(*EtcdClient).DeleteDB\n\t/workspace/internal/datastore/etcd.go:133\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:211\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nunable to delete the datastore\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:212\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
github.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile
    /workspace/controllers/tenantcontrolplane_controller.go:152
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:123
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:270
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:231
2024-07-15T11:19:15Z    INFO    certificate is still valid, enqueuing back  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-api-server-kubelet-client-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-api-server-kubelet-client-certificate", "reconcileID": "1df91c30-3fbe-4d91-8796-d79c011b8c18", "after": "7438h49m56.211676939s"}
2024-07-15T11:19:16Z    INFO    starting CertificateLifecycle handling  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-admin-kubeconfig","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-admin-kubeconfig", "reconcileID": "7cbc598a-d89e-4ce6-9082-35d2c11a0d5f"}
2024-07-15T11:19:19Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "validatingwebhookconfiguration", "controllerGroup": "admissionregistration.k8s.io", "controllerKind": "ValidatingWebhookConfiguration", "source": "kind source: *v1.ValidatingWebhookConfiguration"}
2024-07-15T11:19:18Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet", "source": "kind source: *v1.DaemonSet"}
2024-07-15T11:19:19Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet", "source": "kind source: *v1.ServiceAccount"}
2024-07-15T11:19:19Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ClusterRoleBinding"}
2024-07-15T11:19:18Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "source": "kind source: *v1.ConfigMap"}
2024-07-15T11:19:19Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "source": "channel source: 0xc001b8ee00"}
2024-07-15T11:19:20Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "validatingwebhookconfiguration", "controllerGroup": "admissionregistration.k8s.io", "controllerKind": "ValidatingWebhookConfiguration", "source": "channel source: 0xc001afcb40"}
2024-07-15T11:19:20Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting Controller {"controller": "validatingwebhookconfiguration", "controllerGroup": "admissionregistration.k8s.io", "controllerKind": "ValidatingWebhookConfiguration"}
2024-07-15T11:19:20Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "source": "kind source: *v1.ConfigMap"}
2024-07-15T11:19:20Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "source": "channel source: 0xc001b8e840"}
2024-07-15T11:19:20Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting Controller {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap"}
2024-07-15T11:19:20Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "source": "kind source: *v1.Secret"}
2024-07-15T11:19:20Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "source": "channel source: 0xc001b8f600"}
2024-07-15T11:19:20Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting Controller {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret"}
2024-07-15T11:19:19Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting Controller {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap"}
2024-07-15T11:19:20Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ClusterRoleBinding"}
2024-07-15T11:19:21Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ClusterRoleBinding"}
2024-07-15T11:19:21Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet", "source": "kind source: *v1.ClusterRoleBinding"}
2024-07-15T11:19:19Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ClusterRole"}
2024-07-15T11:19:21Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "channel source: 0xc001b8ff00"}
2024-07-15T11:19:21Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting Controller {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-07-15T11:19:22Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ServiceAccount"}
2024-07-15T11:19:22Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet", "source": "channel source: 0xc001afd300"}
2024-07-15T11:19:22Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ServiceAccount"}
2024-07-15T11:19:22Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting Controller {"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet"}
2024-07-15T11:19:22Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.Service"}
2024-07-15T11:19:23Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ConfigMap"}
2024-07-15T11:19:22Z    INFO    certificate is still valid, enqueuing back  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-admin-kubeconfig","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-admin-kubeconfig", "reconcileID": "7cbc598a-d89e-4ce6-9082-35d2c11a0d5f", "after": "8572h56m10.304161855s"}
2024-07-15T11:19:22Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.Role"}
2024-07-15T11:19:23Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.RoleBinding"}
2024-07-15T11:19:23Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.Deployment"}
2024-07-15T11:19:24Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ConfigMap"}
2024-07-15T11:19:24Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "channel source: 0xc001b8e380"}
2024-07-15T11:19:24Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting Controller {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-07-15T11:19:24Z    INFO    starting CertificateLifecycle handling  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-datastore-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-datastore-certificate", "reconcileID": "b8c0393d-c525-4a20-94d3-888de01f3b99"}
2024-07-15T11:19:25Z    ERROR   Reconciler error    {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"kubernetes-test0","namespace":"tenant-georg"}, "namespace": "tenant-georg", "name": "kubernetes-test0", "reconcileID": "e1226656-9803-4468-87f4-735077b3dd88", "error": "unable to delete the datastore: cannot delete database: context deadline exceeded", "errorVerbose": "context deadline exceeded\ncannot delete database\ngithub.com/clastix/kamaji/internal/datastore/errors.NewCannotDeleteDatabaseError\n\t/workspace/internal/datastore/errors/errors.go:29\ngithub.com/clastix/kamaji/internal/datastore.(*EtcdClient).DeleteDB\n\t/workspace/internal/datastore/etcd.go:133\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:211\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nunable to delete the datastore\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:212\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:333
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:270
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/controller/controller.go:231
2024-07-15T11:19:26Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.DaemonSet"}
2024-07-15T11:19:26Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting EventSource    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "channel source: 0xc001afdbc0"}
2024-07-15T11:19:26Z    INFO    soot_tenant-leotolstoi_kubernetes-test5 Starting Controller {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-07-15T11:19:27Z    INFO    certificate is still valid, enqueuing back  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-datastore-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-datastore-certificate", "reconcileID": "b8c0393d-c525-4a20-94d3-888de01f3b99", "after": "1842h32m50.20592689s"}
2024-07-15T11:19:29Z    INFO    starting CertificateLifecycle handling  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-scheduler-kubeconfig","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-scheduler-kubeconfig", "reconcileID": "6dbf9d38-869a-40f4-b1fc-c206bf6a8097"}
2024-07-15T11:19:32Z    INFO    certificate is still valid, enqueuing back  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-scheduler-kubeconfig","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-scheduler-kubeconfig", "reconcileID": "6dbf9d38-869a-40f4-b1fc-c206bf6a8097", "after": "8572h57m15.009970538s"}
2024-07-15T11:19:32Z    ERROR   controller-runtime.source.EventHandler  failed to get informer from cache   {"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: apps/v1: Get \"https://kubernetes-test5.tenant-leotolstoi.svc:6443/apis/apps/v1?timeout=10s\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.1-0.20240416095710-67b27f27e514/pkg/internal/source/kind.go:56
2024-07-15T11:19:33Z    INFO    starting CertificateLifecycle handling  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-controller-manager-kubeconfig","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-controller-manager-kubeconfig", "reconcileID": "f1f2aaa1-2320-4565-b8c3-056465a8a2c6"}
kubectl get pods -n cozy-kamaji                                             
NAME                     READY   STATUS             RESTARTS          AGE
kamaji-c7448f786-v6cmt   0/1     CrashLoopBackOff   388 (2m14s ago)   6d19h

details will be below @kvaps

prometherion commented 1 month ago

I'm a bit confused here, @gecube. May I ask you a precise way to replicate this and which Namespace have been deleted?

kvaps commented 1 month ago

Hi @prometherion, I'm currently ivestigating this issue:

I found that namespace has been stuck in terminating state:

NAME             STATUS        AGE
tenant-georg03   Terminating   28h

in describe I can see that it is because of kamaji finalizer:

Conditions:
  Type                                         Status  LastTransitionTime               Reason                Message
  ----                                         ------  ------------------               ------                -------
  NamespaceDeletionDiscoveryFailure            False   Mon, 15 Jul 2024 12:48:25 +0200  ResourcesDiscovered   All resources successfully discovered
  NamespaceDeletionGroupVersionParsingFailure  False   Mon, 15 Jul 2024 12:48:25 +0200  ParsedGroupVersions   All legacy kube types successfully parsed
  NamespaceDeletionContentFailure              False   Mon, 15 Jul 2024 12:48:55 +0200  ContentDeleted        All content successfully deleted, may be waiting on finalization
  NamespaceContentRemaining                    True    Mon, 15 Jul 2024 12:48:25 +0200  SomeResourcesRemain   Some resources are remaining: secrets. has 1 resource instances
  NamespaceFinalizersRemaining                 True    Mon, 15 Jul 2024 12:48:25 +0200  SomeFinalizersRemain  Some content in the namespace has finalizers remaining: finalizer.kamaji.clastix.io/datastore-secret in 1 resource instances
Status:       Terminating
Conditions:
  Type                                         Status  LastTransitionTime               Reason                Message
  ----                                         ------  ------------------               ------                -------
  NamespaceDeletionDiscoveryFailure            False   Mon, 15 Jul 2024 12:48:25 +0200  ResourcesDiscovered   All resources successfully discovered
  NamespaceDeletionGroupVersionParsingFailure  False   Mon, 15 Jul 2024 12:48:25 +0200  ParsedGroupVersions   All legacy kube types successfully parsed
  NamespaceDeletionContentFailure              False   Mon, 15 Jul 2024 12:48:55 +0200  ContentDeleted        All content successfully deleted, may be waiting on finalization
  NamespaceContentRemaining                    True    Mon, 15 Jul 2024 12:48:25 +0200  SomeResourcesRemain   Some resources are remaining: secrets. has 1 resource instances
  NamespaceFinalizersRemaining                 True    Mon, 15 Jul 2024 12:48:25 +0200  SomeFinalizersRemain  Some content in the namespace has finalizers remaining: finalizer.kamaji.clastix.io/datastore-secret in 1 resource instances

Inside this namespace I can see that secret is not deleted:

NAME                                      NAMESPACE       AGE
secret/kubernetes-test0-datastore-config  tenant-georg03  27h
apiVersion: v1
data:
  DB_CONNECTION_STRING: ""
  DB_PASSWORD: <redacted>
  DB_SCHEMA: <redacted>
  DB_USER: <redacted>
kind: Secret
metadata:
  annotations:
    kamaji.clastix.io/checksum: b476dd8320d286bd6ef6fdf0bde47c42
  creationTimestamp: "2024-07-14T08:09:43Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-07-14T08:10:39Z"
  finalizers:
  - finalizer.kamaji.clastix.io/datastore-secret
  labels:
    kamaji.clastix.io/component: datastore-config
    kamaji.clastix.io/name: kubernetes-test0
    kamaji.clastix.io/project: kamaji
  name: kubernetes-test0-datastore-config
  namespace: tenant-georg03
  ownerReferences:
  - apiVersion: kamaji.clastix.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: TenantControlPlane
    name: kubernetes-test0
    uid: 424d8606-235b-4ed6-9706-7497bf97f194
  resourceVersion: "108976218"
  uid: c9a61283-8283-472c-bdf7-f59fb6b3631d
type: Opaque
kvaps commented 1 month ago

from the kamaji logs it's only seen:

2024-07-15T12:02:27Z    INFO    resource have been deleted, skipping    {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"kubernetes-test0","namespace":"tenant-georg03"}, "namespace": "tenant-georg03", "name": "kubernetes-test0", "reconcileID": "cafc76d2-0041-4e7e-ae70-085214c06675"}
gecube commented 1 month ago

Some additional details. The issue was observed on https://github.com/aenix-io/cozystack installation. The actions lead to the issue:

kubectl get helmrelease -n tenant-georg25
NAME                               AGE   READY   STATUS
clickhouse-test4                   61m   True    Helm install succeeded for release tenant-georg25/clickhouse-test4.v1 with chart clickhouse@0.2.1
copy-kafka-secret                  61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
fluxcd-test4                       61m           
kafka-test4                        61m   True    Helm install succeeded for release tenant-georg25/kafka-test4.v1 with chart kafka@0.2.0
kubernetes-test4                   61m   False   Helm install failed for release tenant-georg25/kubernetes-test4 with chart kubernetes@0.6.0: client rate limiter Wait returned an error: context deadline exceeded
kubernetes-test4-cert-manager      61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
kubernetes-test4-cilium            61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
kubernetes-test4-csi               61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
kubernetes-test4-fluxcd            61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
kubernetes-test4-fluxcd-operator   61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
stand-25                           61m   True    Helm install succeeded for release tenant-georg25/stand-25.v1 with chart helmreleases@1.0.5
kubectl describe helmrelease kubernetes-test4 -n tenant-georg25
Name:         kubernetes-test4
Namespace:    tenant-georg25
Labels:       app.kubernetes.io/managed-by=Helm
              helm.toolkit.fluxcd.io/name=stand-25
              helm.toolkit.fluxcd.io/namespace=tenant-georg25
Annotations:  meta.helm.sh/release-name: stand-25
              meta.helm.sh/release-namespace: tenant-georg25
API Version:  helm.toolkit.fluxcd.io/v2
Kind:         HelmRelease
Metadata:
  Creation Timestamp:  2024-07-15T11:13:21Z
  Finalizers:
    finalizers.fluxcd.io
  Generation:        1
  Resource Version:  110739693
  UID:               9b671d1a-9def-40b1-852b-fb2cf7158ce0
Spec:
  Chart:
    Spec:
      Chart:               kubernetes
      Reconcile Strategy:  ChartVersion
      Source Ref:
        Kind:       HelmRepository
        Name:       cozystack-apps
        Namespace:  cozy-public
      Version:      0.6.0
  Interval:         1m0s
  Release Name:     kubernetes-test4
  Values:
    Addons:
      Cert Manager:
        Enabled:  true
      Fluxcd:
        Enabled:  true
    Control Plane:
      Replicas:  2
    Host:        
    Node Groups:
      md0:
        Max Replicas:  3
        Min Replicas:  0
        Resources:
          Cpu:     4
          Memory:  8192Mi
Status:
  Conditions:
    Last Transition Time:  2024-07-15T11:18:24Z
    Message:               Failed to install after 1 attempt(s)
    Observed Generation:   1
    Reason:                RetriesExceeded
    Status:                True
    Type:                  Stalled
    Last Transition Time:  2024-07-15T11:18:23Z
    Message:               Helm install failed for release tenant-georg25/kubernetes-test4 with chart kubernetes@0.6.0: client rate limiter Wait returned an error: context deadline exceeded
    Observed Generation:   1
    Reason:                InstallFailed
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-07-15T11:18:23Z
    Message:               Helm install failed for release tenant-georg25/kubernetes-test4 with chart kubernetes@0.6.0: client rate limiter Wait returned an error: context deadline exceeded
    Observed Generation:   1
    Reason:                InstallFailed
    Status:                False
    Type:                  Released
  Failures:                1
  Helm Chart:              cozy-public/tenant-georg25-kubernetes-test4
  History:
    App Version:                  1.30.1
    Chart Name:                   kubernetes
    Chart Version:                0.6.0
    Config Digest:                sha256:ffeab28c6a02570626b22a3fc541f433f86ffc8c95b4b7500b8fce016871bc89
    Digest:                       sha256:0f0026055b565f13bc594fc5c829e93fafe166a66f8aa5c36b94a45751c3fc10
    First Deployed:               2024-07-15T11:13:23Z
    Last Deployed:                2024-07-15T11:13:23Z
    Name:                         kubernetes-test4
    Namespace:                    tenant-georg25
    Status:                       failed
    Version:                      1
  Install Failures:               1
  Last Attempted Config Digest:   sha256:ffeab28c6a02570626b22a3fc541f433f86ffc8c95b4b7500b8fce016871bc89
  Last Attempted Generation:      1
  Last Attempted Release Action:  install
  Last Attempted Revision:        0.6.0
  Observed Generation:            1
  Storage Namespace:              tenant-georg25
Events:
  Type     Reason         Age   From             Message
  ----     ------         ----  ----             -------
  Warning  InstallFailed  56m   helm-controller  Helm install failed for release tenant-georg25/kubernetes-test4 with chart kubernetes@0.6.0: client rate limiter Wait returned an error: context deadline exceeded

Last Helm logs:

2024-07-15T11:13:23.739042656Z: Starting delete for "kubernetes-test4-flux-teardown" Role
2024-07-15T11:13:23.741742858Z: Ignoring delete failure for "kubernetes-test4-flux-teardown" rbac.authorization.k8s.io/v1, Kind=Role: roles.rbac.authorization.k8s.io "kubernetes-test4-flux-teardown" not found
2024-07-15T11:13:23.741754738Z: beginning wait for 1 resources to be deleted with timeout of 5m0s
2024-07-15T11:13:23.761647555Z: creating 1 resource(s)
2024-07-15T11:13:23.765293151Z: Starting delete for "kubernetes-test4-flux-teardown" Role
2024-07-15T11:13:23.769583372Z: beginning wait for 1 resources to be deleted with timeout of 5m0s
2024-07-15T11:13:23.77429976Z: creating 27 resource(s)
2024-07-15T11:13:23.888462685Z: beginning wait for 27 resources with timeout of 5m0s
2024-07-15T11:13:23.894387715Z: Deployment is not ready: tenant-georg25/kubernetes-test4-cluster-autoscaler. 0 out of 1 expected pods are ready
2024-07-15T11:18:21.896596776Z: Deployment is not ready: tenant-georg25/kubernetes-test4-cluster-autoscaler. 0 out of 1 expected pods are ready (148 duplicate lines omitted)

Checking the pods I found that they could not start because of absence of kubeconfig:

kubectl get -n tenant-georg25 all
Warning: kubevirt.io/v1 VirtualMachineInstancePresets is now deprecated and will be removed in v2.
NAME                                                       READY   STATUS              RESTARTS   AGE
pod/chi-clickhouse-test4-clickhouse-0-0-0                  1/1     Running             0          9m
pod/kafka-test4-entity-operator-6c765b8f96-pzf9p           2/2     Running             0          7m22s
pod/kafka-test4-kafka-0                                    1/1     Running             0          8m24s
pod/kafka-test4-kafka-1                                    1/1     Running             0          8m24s
pod/kafka-test4-kafka-2                                    1/1     Running             0          8m24s
pod/kafka-test4-zookeeper-0                                1/1     Running             0          9m2s
pod/kafka-test4-zookeeper-1                                1/1     Running             0          9m2s
pod/kafka-test4-zookeeper-2                                1/1     Running             0          9m2s
pod/kubernetes-test4-cluster-autoscaler-598b659b6c-tfrll   0/1     ContainerCreating   0          9m3s
pod/kubernetes-test4-kccm-8445bbb6bb-bmwp2                 0/1     ContainerCreating   0          9m3s
pod/kubernetes-test4-kcsi-controller-8bd74cc96-l64xc       0/4     ContainerCreating   0          9m3s

...

Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    10m                 default-scheduler  Successfully assigned tenant-georg25/kubernetes-test4-cluster-autoscaler-598b659b6c-tfrll to srv3
  Warning  FailedMount  17s (x13 over 10m)  kubelet            MountVolume.SetUp failed for volume "kubeconfig" : secret "kubernetes-test4-admin-kubeconfig" not found
prometherion commented 1 month ago

We added the finalizer to the Datastore secret since this is required to delete the Datastore data, such as key prefixes for etcd, and schemas for RDBMS.

I'm a bit lost with the Cozystack terminology, thanks for the patience here, please, may I ask you to confirm these are the right steps to reproduce?

  1. Install Kamaji
  2. Create a Tenant Control Plane in its own Namespace
  3. Delete the Namespace
  4. Kamaji crashes
kvaps commented 1 month ago

Unfortunately I cannot reproduce this behavior :-(

I just see this secret is created, and kamaji is not trying to remove it nor finalizer from it

prometherion commented 1 month ago

@gecube reading again the reported logs, it seems to me Kamaji is not able to delete the given Tenant since the connection with the related etcd is broken (context deadline exceeded).

Where is the Datastore located? Furthermore, what's the error causing the CrashLoopbackoff for the Kamaji pod? I wonder about some health checks, or is it a nil pointer dereference?

kvaps commented 1 month ago

It seems the problem occurrs only when datastore is not available, I was able to reproduce it:

  1. create ns
  2. create tenantcontrolplane
  3. make the datastore unavailable
  4. remove namespace where TCP installed
  5. restart kamaji

check that namespace is still holding the secret for accessing database:

NAME                                    TYPE     DATA   AGE
kubernetes-qweqweqwe-datastore-config   Opaque   4      36m

it has finalizer which is blocking namespace removal.

Kamaji removes tenantcontrolplanes.kamaji.clastix.io from the namespace even if datastore is not available. So if you recover datastore later the orphan secret won't be reconciled

prometherion commented 1 month ago

I was able to reproduce this, but Kamaji is not in CrashLoopBackOff (v1.0.0)

NAME                              READY   STATUS      RESTARTS       AGE
etcd-0                            1/1     Running     7 (6d8h ago)   36d
etcd-1                            1/1     Running     7 (6d8h ago)   36d
etcd-2                            1/1     Running     7 (6d8h ago)   36d
etcd-nvme-defrag-28684290-mmg24   0/1     Completed   0              2m39s
kamaji-56649dbd78-hx2gq           1/1     Running     0              2m28s

I see in the logs kamaji tries to connect to the given Datastore, and that's ok: the problem here is that Kamaji is not aware of your business logic.

Not sure if it's the case, but let's take for granted you're deleting the Datastore/etcd in the same Namespace where the Tenant Control Plane resides: we know the Tenant Control Plane has a dependency with the Datastore that must be finalized prior the deletion of the Datastore itself. Kamaji is not aware you're deleting the entire Namespace and the etcd is gone, so it tries to constantly reconcile the finalizer by performing the clean-up.

I would suggest you, if it's possible, to have an order in the actions, as we have with the creation of a Tenant Control Plane where:

  1. a Datastore is created
  2. then, the Tenant Control Plane for the given Datastore, is created

With the same principle, the deletion requires:

  1. Deletion of the Tenant Control Plane
  2. Eventually, deletion of the Datastore

It could be your etcd is unreachable for a specific reason as I did in my test (kubectl scale sts --replicas=0): even here, once scaled up to normal, Kamaji has been able to connect to the etcd, performing clean-up, and then the TCP has been removed, as well as the secrets, and the Namespace.

I don't think this is a bug report we have to address, it sounds more an edge case where you have to orchestrate better your platform on top of Kamaji.

gecube commented 1 month ago

@prometherion thanks for the reproduction. I think that we could not rely on removal order anyway. If we can implement the order for applying objects - there are many mechanism for it, particularly in Helm itself or FluxCD, but for the removal we can expect anything. Like user comes and removes the namespace completely, because he is not aware of complex logic under the hood. And we can do nothing on platform level with it. The only option (as I believe) - is to write all controllers in such a manner that:

  1. controller never resides in the same namespace as CR it manages. Otherwise it is easy to run into the situation when controller pod already removed (as removal order is not strictly defined) and there is nobody who can handle finaliser. I already faced the same issues with even FluxCD itself and Victoria Logs operator.
  2. controller properly addresses the removal of all objects it manages.
prometherion commented 1 month ago

I'd like to help here, but unfortunately, it's out of our control.

If the user makes the Datastore unavailable for any reason, and user deletes the TenantControlPlane, Kamaji still relies on the clean-up of those resources. It makes sense since we don't want to have etcd with orphaned keys, and given the context here (such as having etcd unreachable, and the user deleting the Namespace) it sounds like an edge case.

The addressable bug here is the CrashLoopBackOff which is non-reproducible, at least with v1.0.0.

As I said before, without being nasty, it's not a bug per se, but the typical Kubernetes scenario where there's a chain of dependencies that must be known by the user, or if it's orchestrated by a third-party platform, it must know and orchestrated accordingly.

I'm going to close this issue but:

  1. happy to open it back if you're able to provide me further details on how to replicate the CrashLoopBackOff
  2. happy to continue the discussion for an enhanced proposal
  3. given the fact once the Datastorec connection was established back correctly the deletion was completed successfully.