DopplerHQ / kubernetes-operator

Apache License 2.0
44 stars 18 forks source link

Reconcile algorithm overuses the Doppler API #58

Open miskr-instructure opened 6 months ago

miskr-instructure commented 6 months ago

We have been swapping out an in-house k8s reloading/secret handling solution to this official operator solution. Previously we used Doppler Webhooks for reloading, and an init script at container startup to load the current secrets.

What we have noticed since migrating is that we gets lots of these errors in the pod logs:

Doppler Error: Exceeded rate limit of 240 secret read requests within 60 seconds. Retry in 1 seconds. Upgrade to the Enterprise plan to increase your limit

We have many DopplerSecret custom resources (but many of them reference the same Doppler Config actually). Despite there being many, they rarely ever change (on the frequency level of 1-2 changes per week), so we should not exceed the API rate limits.

It doesn't make any sense to me that the operator needs to HTTP-GET all secrets every N seconds from the Doppler API. It should be able to use a functionality similar to the Webhooks to do push-based reconciliation instead of polling Doppler. Doing so would drastically reduce the API load on Doppler.

I would propose one of two solutions:

  1. Stick to polling, rely on ETag and If-None-Match headers to decrease load on Doppler API (looks like Etag is already implemented) but increase the API rate limit specifically for HTTP-304 responses since they should be less expensive for Doppler than HTTP-200 reponses.
  2. Use a functionality similar to the Webhooks to do push-based reconciliation instead of polling. For example, I would be okay with exposing the operator via an ingress so it could receive notifications from Doppler. Then the polling-based solution could be kept as a fallback with its frequency significantly decreased.
watsonian commented 6 months ago

@miskr-instructure Sorry that you're running into this. It's definitely a potential risk if you start having a huge number of DopplerSecret resources created. You can actually adjust this resync interval for each DopplerSecret, which may help you here. This wasn't documented until today because we typically don't recommend that people go tweaking this unless they're running into the situation you are now. The field you want to adjust is called resyncSeconds and you use it like this:

apiVersion: secrets.doppler.com/v1alpha1
kind: DopplerSecret
metadata:
  name: dopplersecret-test
  namespace: doppler-operator-system
spec:
  tokenSecret:
    name: doppler-token-secret
  managedSecret:
    name: doppler-test-secret
    namespace: default
  resyncSeconds: 120

Regarding the issue more broadly, we do have a SSE endpoint that the --watch flag for our CLI uses and we've discussed potentially moving the operator over to this, but there are some architectural implications which have prevented it thus far. Longer term, that's definitely the direction we'd like to go though. Short term, adjusting resyncSeconds should help!

miskr-instructure commented 6 months ago

Hi, 2 questions:

Short term, I think it'd be cool if the operator had an optional Service endpoint that was capable of receiving doppler's own POST webhooks (and marking appropriate doppler secrets for resync) - the ingress and network infra setup to make it reachable from doppler doesn't need to be included in the charts/manifests, that is easy to handle internally (and the webhooks are already authenticated with HMAC I believe).