Reconcile algorithm overuses the Doppler API

miskr-instructure commented 6 months ago

We have been swapping out an in-house k8s reloading/secret handling solution to this official operator solution. Previously we used Doppler Webhooks for reloading, and an init script at container startup to load the current secrets.

What we have noticed since migrating is that we gets lots of these errors in the pod logs:

Doppler Error: Exceeded rate limit of 240 secret read requests within 60 seconds. Retry in 1 seconds. Upgrade to the Enterprise plan to increase your limit

We have many DopplerSecret custom resources (but many of them reference the same Doppler Config actually). Despite there being many, they rarely ever change (on the frequency level of 1-2 changes per week), so we should not exceed the API rate limits.

It doesn't make any sense to me that the operator needs to HTTP-GET all secrets every N seconds from the Doppler API. It should be able to use a functionality similar to the Webhooks to do push-based reconciliation instead of polling Doppler. Doing so would drastically reduce the API load on Doppler.

I would propose one of two solutions:

Stick to polling, rely on ETag and If-None-Match headers to decrease load on Doppler API (looks like Etag is already implemented) but increase the API rate limit specifically for HTTP-304 responses since they should be less expensive for Doppler than HTTP-200 reponses.
Use a functionality similar to the Webhooks to do push-based reconciliation instead of polling. For example, I would be okay with exposing the operator via an ingress so it could receive notifications from Doppler. Then the polling-based solution could be kept as a fallback with its frequency significantly decreased.

watsonian commented 6 months ago

@miskr-instructure Sorry that you're running into this. It's definitely a potential risk if you start having a huge number of DopplerSecret resources created. You can actually adjust this resync interval for each DopplerSecret, which may help you here. This wasn't documented until today because we typically don't recommend that people go tweaking this unless they're running into the situation you are now. The field you want to adjust is called resyncSeconds and you use it like this:

apiVersion: secrets.doppler.com/v1alpha1
kind: DopplerSecret
metadata:
  name: dopplersecret-test
  namespace: doppler-operator-system
spec:
  tokenSecret:
    name: doppler-token-secret
  managedSecret:
    name: doppler-test-secret
    namespace: default
  resyncSeconds: 120

Regarding the issue more broadly, we do have a SSE endpoint that the --watch flag for our CLI uses and we've discussed potentially moving the operator over to this, but there are some architectural implications which have prevented it thus far. Longer term, that's definitely the direction we'd like to go though. Short term, adjusting resyncSeconds should help!

miskr-instructure commented 6 months ago

Hi, 2 questions:

I did not check the code, but is the operator bursting the API (eg.: 50 secret crds with 120 retrySeconds -> 50 requests in quick burst, then silence for the rest of 120s) or spreading the requests out somehow over the resync period? Because if it's the former, increasing resync may not help too much (unless API rate limit is credit-based and accumulates) - because then every 2 minutes the same burst would happen, no?
increasing resyncSeconds will increase the time it takes for (the very occasional) changes to be propagated, right?

Short term, I think it'd be cool if the operator had an optional Service endpoint that was capable of receiving doppler's own POST webhooks (and marking appropriate doppler secrets for resync) - the ingress and network infra setup to make it reachable from doppler doesn't need to be included in the charts/manifests, that is easy to handle internally (and the webhooks are already authenticated with HMAC I believe).

DopplerHQ / kubernetes-operator

Reconcile algorithm overuses the Doppler API #58