Closed apex-omontgomery closed 3 years ago
Sorry for the late response @wimo7083, was busy with some other projects. I want to try and completely remove the need for a restart - which is there for legacy reasons, I'll take a look on that, and will update with finding ASAP.
Hi @wimo7083, I've created a fix in #599
You are welcome to test it and report back until we release it.
The image name is: soluto/kamus:controller-infite-watch
Thank you for going above and beyond, I was honestly just hoping for some best practices. I'll test this out Monday when I return.
Hi @wimo7083 any news?
I had to hold off on trying this out due to other reasons. I'm going to apply this to our non-prod environments shortly and let it bake for a while. Thank you for your patience.
I just applied this to our dev environment. We had to make some changes and make sure that the current encryptor and decryptor we were using were properly set up.
The baseline for us is 90 restarts in ~ 4 days
NAME READY STATUS RESTARTS AGE
kamus-controller-$ID 0/1 Terminating 90 3d17h
mirrored the image supplied with the following versions of the encryptor and decryptor:
crane copy docker.io/soluto/kamus:controller-infite-watch $INTERNAL_MIRROR/mirror/docker.io/soluto/kamus:controller-infinite-watch
crane copy docker.io/soluto/kamus:encryptor-0.8.0.0 $INTERNAL_MIRROR/mirror/docker.io/soluto/kamus:encryptor-infinite-watch
crane copy docker.io/soluto/kamus:decryptor-0.8.0.0 $INTERNAL_MIRROR/mirror/docker.io/soluto/kamus:decryptor-infinite-watch
Running pod hash
containerStatuses:
image: $INTERNAL_MIRROR/mirror/docker.io/soluto/kamus:controller-infinite-watch
imageID: docker-pullable://$INTERNAL_MIRROR/mirror/docker.io/soluto/kamus@sha256:2c2b8712cf1ac85ba0f1d7177d28eefeee023543c9445883722024e9c405d712
Remote image hash
$ docker pull docker.io/soluto/kamus:controller-infite-watch | grep Digest
Digest: sha256:2c2b8712cf1ac85ba0f1d7177d28eefeee023543c9445883722024e9c405d712
90 restarts in about 2 days.
kamus-controller-$ID 1/1 Running 90 2d20h
Is there anything from the logs or events that would be helpful?
Hi @wimo7083, thanks for taking the time and testing that, that's super surprising. I'll be happy to get some logs from the controller, a bit before it's being restarted. I'll have to run some tests on my own to validate that, stay tuned.
Hi @wimo7083. Version 0.9.0.5 was just released (chart version 0.9.5). It was tested on my side and it was found restart free :-)
Please notice that KamusSecret v1alphav1
was dropped at version 0.9 - so in case you use it, please convert to v1alphav2
per the changelog documentation.
Please reopen if you still see that issue.
Describe the bug This is less of a bug, and more asking what should be done to get around this issue which causes operational troubles from the scheduled kamus-controller hourly restart.
I was asked to make a new issue regarding this.
I understand kamus-controller restarting every 60 minutes is normal.
I'm not sure how to make dependent systems behave properly when this scheduled "downtime" occurs.
I've seen kamus-controller cause problems when a new HelmRelease is pushed out.
An example of a dependent system having trouble during these restarts
I've seen two types of failure modes: kamus-init-containers
KamusSecret failure:
This is a mixture of 3 problems (flux, kamus, helm), so I don't fault any one of them. My only thought is to add more replicas of the kamus-controller, but all I'm doing is reducing the failure rate (if this is even recommended), I'm not even sure if a poddisruptionbudget with 2 replicas would matter if the pod itself is causing the restart.
Versions used I can include my versions if desired, as this is a question on how to get around the design of kamus-controller restarts.