Soluto / kamus

An open source, git-ops, zero-trust secret encryption and decryption solution for Kubernetes applications
https://kamus.soluto.io
Apache License 2.0
929 stars 68 forks source link

How to mitigate kamus-controller restarts impacting dependent systems? #598

Closed apex-omontgomery closed 3 years ago

apex-omontgomery commented 3 years ago

Describe the bug This is less of a bug, and more asking what should be done to get around this issue which causes operational troubles from the scheduled kamus-controller hourly restart.

I was asked to make a new issue regarding this.

I understand kamus-controller restarting every 60 minutes is normal.

I'm not sure how to make dependent systems behave properly when this scheduled "downtime" occurs.

I've seen kamus-controller cause problems when a new HelmRelease is pushed out.

An example of a dependent system having trouble during these restarts

{
  "caller": "loop.go:108",
  "component": "sync-loop",
  "err": "collating resources in cluster for sync: conversion webhook for soluto.com/v1alpha2, Kind=KamusSecret failed: Post https://kamus-controller.kamus.svc:443/api/v1/conversion-webhook?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
  "ts": "2020-11-04T00:04:18.249326875Z"
}

I've seen two types of failure modes: kamus-init-containers

  1. New/ modified HR- flux notices and updates HR
  2. Helm-operator notices- and tries to updates using the kamus-init-container
  3. Endpoint fails which gives the log above, this cause the CM or whatever you are creating with kamus-init-container to fail
  4. Since Kamus couldn't secret- the helmrelease fails

KamusSecret failure:

  1. New/ modified HR- flux notices and updates HR and the KamusSecret
  2. kamus-controller restarts which causes a 1-3 minute delay on performing conversion.
  3. Something in the HR is dependent upon the corresponding output "secret" object- this delay causes an ordered dependency update failure.
  4. The dependent resource isn't smart enough/ aware enough to retry and the helm hooks aren't configured properly to handle this.

This is a mixture of 3 problems (flux, kamus, helm), so I don't fault any one of them. My only thought is to add more replicas of the kamus-controller, but all I'm doing is reducing the failure rate (if this is even recommended), I'm not even sure if a poddisruptionbudget with 2 replicas would matter if the pod itself is causing the restart.

Versions used I can include my versions if desired, as this is a question on how to get around the design of kamus-controller restarts.

shaikatz commented 3 years ago

Sorry for the late response @wimo7083, was busy with some other projects. I want to try and completely remove the need for a restart - which is there for legacy reasons, I'll take a look on that, and will update with finding ASAP.

shaikatz commented 3 years ago

Hi @wimo7083, I've created a fix in #599 You are welcome to test it and report back until we release it. The image name is: soluto/kamus:controller-infite-watch

apex-omontgomery commented 3 years ago

Thank you for going above and beyond, I was honestly just hoping for some best practices. I'll test this out Monday when I return.

shaikatz commented 3 years ago

Hi @wimo7083 any news?

apex-omontgomery commented 3 years ago

I had to hold off on trying this out due to other reasons. I'm going to apply this to our non-prod environments shortly and let it bake for a while. Thank you for your patience.

apex-omontgomery commented 3 years ago

I just applied this to our dev environment. We had to make some changes and make sure that the current encryptor and decryptor we were using were properly set up.

The baseline for us is 90 restarts in ~ 4 days

NAME                                READY   STATUS              RESTARTS   AGE
kamus-controller-$ID   0/1     Terminating         90         3d17h
apex-omontgomery commented 3 years ago

mirrored the image supplied with the following versions of the encryptor and decryptor:

crane copy docker.io/soluto/kamus:controller-infite-watch $INTERNAL_MIRROR/mirror/docker.io/soluto/kamus:controller-infinite-watch
crane copy docker.io/soluto/kamus:encryptor-0.8.0.0 $INTERNAL_MIRROR/mirror/docker.io/soluto/kamus:encryptor-infinite-watch
crane copy docker.io/soluto/kamus:decryptor-0.8.0.0 $INTERNAL_MIRROR/mirror/docker.io/soluto/kamus:decryptor-infinite-watch

Running pod hash

  containerStatuses:
     image: $INTERNAL_MIRROR/mirror/docker.io/soluto/kamus:controller-infinite-watch
    imageID: docker-pullable://$INTERNAL_MIRROR/mirror/docker.io/soluto/kamus@sha256:2c2b8712cf1ac85ba0f1d7177d28eefeee023543c9445883722024e9c405d712

Remote image hash

$ docker pull docker.io/soluto/kamus:controller-infite-watch | grep Digest
Digest: sha256:2c2b8712cf1ac85ba0f1d7177d28eefeee023543c9445883722024e9c405d712

90 restarts in about 2 days.

kamus-controller-$ID   1/1     Running   90         2d20h

Is there anything from the logs or events that would be helpful?

shaikatz commented 3 years ago

Hi @wimo7083, thanks for taking the time and testing that, that's super surprising. I'll be happy to get some logs from the controller, a bit before it's being restarted. I'll have to run some tests on my own to validate that, stay tuned.

shaikatz commented 3 years ago

Hi @wimo7083. Version 0.9.0.5 was just released (chart version 0.9.5). It was tested on my side and it was found restart free :-)

Please notice that KamusSecret v1alphav1 was dropped at version 0.9 - so in case you use it, please convert to v1alphav2 per the changelog documentation.

Please reopen if you still see that issue.