Azure / secrets-store-csi-driver-provider-azure

Azure Key Vault provider for Secret Store CSI driver allows you to get secret contents stored in Azure Key Vault instance and use the Secret Store CSI driver interface to mount them into Kubernetes pods.
https://azure.github.io/secrets-store-csi-driver-provider-azure/
MIT License
439 stars 193 forks source link

[Question] Key Vault Throttling Limits with a Large Number of Pods #722

Closed briantre closed 1 month ago

briantre commented 2 years ago

We're considering using this secrets provider in our service but we're concerned about getting throttled by Key Vault when there are a large number of pods running in our cluster. I want to make sure that my understanding of the system is correct before we move forward.

Is it correct to say that for every polling interval that the number of requests being sent to Key Vault is equal to (number of pods) X (number of secrets)? Also, do all pods poll at the same time? If both of those assumptions are correct, then supposing that my service needs to retrieve 4 secrets from Key Vault then we'd be limited to a maximum of 500 pods in order to stay under the Key Vault throttling limit of 2000 requests per 10 seconds. Correct?

nilekhc commented 2 years ago

Hello @briantre, thanks for the question. We do not process rotation concurrently. Every rotation call is translated to the keyvault GET call for the secrets/keys defined per pod + SecretProviderClass(SPC). You can refer SPC into multiple pods. We create SecretProviderClassPodStatuses (SPCPS) resource per Pod. This has information of all secrets being used in the Pod. During rotation, we list all SPCPS and linearly checks for newer version of secret.

Having said that, How often are you updating the secret/keys in keyvault? Setting aggressive rotation poll interval will definitely lead to throttling. AKV team recommends caching for upto 8 hours.

Also, did you get a chance to look at our load test results? We have performed test on a large cluster with 10000 pods using CSI volume. This might be helpful for your planning as well.

Hope this helps.

briantre commented 2 years ago

Hello @nilekhc , thank you so much for your reply. I had not seen the results of the load test before, thank you for pointing those out to me! The results of the load test look very encouraging but I'm still trying to wrap my head around why the load test didn't run into issues with throttling. If my understanding is correct, in the load test there were 10,000 pods that all referenced the same SecretProviderClass which was syncing two secrets. That meant that there were 10,000 SecretProviderClassPodStatus instances as well, since there is one for each pod. You say that,

During rotation, we list all SPCPS and linearly checks for newer version of secret.

So it would seem that for each of the 10,000 SPCPS instances the driver would be doing a GET for both secrets. That would work out to 10,000 SPCPS instances x 2 secrets = 20,000 requests that are all going to Key Vault one right after the other. Correct? Since that would surely end up being throttled, then there must be another piece to the puzzle that I'm not yet fully understanding.

For us, we have integrated the CSI driver to handle just a couple of secrets for us, including our TLS certificate. I pulled the logs from the csi-csi-secrets-store-provider-azure to try and understand the behavior. I found a log line that contains fetching object from key vault. It looks like that line is logged shortly before a request is sent to Key Vault. I massaged the logs a bit and I can see that during the update process the same object is being pulled from Key Vault for each pod. Here is a snippet of the massaged logs. We can see the name of the object being retrieved from Key Vault along with an array of objects containing the timestamp and the name of the pod that it's being retrieved for.

{
  "objectName": "my-tls-certificate",
  "requests": [
    {
      "ts": "2021-11-21T09:40:24Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T09:40:25Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T09:42:25Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T09:42:25Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T09:44:24Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T09:44:25Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T09:46:24Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T09:46:25Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T09:48:24Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T09:48:25Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T09:50:25Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T09:50:25Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T09:52:24Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T09:52:25Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T09:54:24Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T09:54:25Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T09:56:25Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T09:56:25Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T09:58:25Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T09:58:25Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T10:00:25Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T10:00:25Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T10:02:25Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    },
    {
      "ts": "2021-11-21T10:02:25Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T10:04:24Z",
      "pod": "traefik/traefik-d98d7dcbf-npvfv"
    },
    {
      "ts": "2021-11-21T10:04:25Z",
      "pod": "traefik/traefik-d98d7dcbf-hjxfm"
    }
  ]
}

Looking at timestamp 2021-11-21T09:42:25Z we can see that the object named my-tls-certificate was pulled multiple times because it was referenced by multiple pods. This strikes me as odd since it feels unnecessary to pull the same secret over and over again just to supply it to different pods. Additionally, it would seem that as the number of pods increases so then too would the number of requests and that would lead to an increased likelihood of being throttled.

Lastly, we don't intend to update our secrets very often. I think that the 8 hour polling interval would definitely be sufficient for our needs. These logs came from our dev environment where we using the default polling interval of 2 minutes.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 21 days with no activity. Feel free to re-open if you are experiencing the issue again.

kchilka-msft commented 1 year ago

hi @nilekhc - I wanted to re-open the question as I am trying to understand the workings of the library as well. I have the similar understanding as @briantre.

@nilekhc - can you please answer @briantre question, this will help clear the confusion:

So it would seem that for each of the 10,000 SPCPS instances the driver would be doing a GET for both secrets. That would work out to 10,000 SPCPS instances x 2 secrets = 20,000 requests that are all going to Key Vault one right after the other. Correct? Since that would surely end up being throttled, then there must be another piece to the puzzle that I'm not yet fully understanding

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.

github-actions[bot] commented 1 month ago

This issue was closed because it has been stalled for 21 days with no activity. Feel free to re-open if you are experiencing the issue again.