Azure / secrets-store-csi-driver-provider-azure

Azure Key Vault provider for Secret Store CSI driver allows you to get secret contents stored in Azure Key Vault instance and use the Secret Store CSI driver interface to mount them into Kubernetes pods.
https://azure.github.io/secrets-store-csi-driver-provider-azure/
MIT License
435 stars 190 forks source link

Support multiple Azure Key Vault instances as fallback #1433

Open JorTurFer opened 7 months ago

JorTurFer commented 7 months ago

Describe the solution you'd like Yesterday there was an issue in Azure Key Vault service in west europe (probably a maintenance or so, because ALL our vaults were affected, doesn't matter the subscription). The health monitors show something like: image

Although the service issue isn't reponsibility of this driver, having a plan B to mitigate this would have been nice. In theory, Azure Key Vault is transparently replicated in the paired region with automatic failover in read-only mode, but it didn't happen.

We use multiple regions to be resilient to region failures but currently the secrets-store-csi is a single point of failure as it doesn't support any type of fallback at any level.

Given that, I'd like to propose extending current behavior to support other Azure Key Vaults as failover if the primary instance fails.

Current configuration looks like:

parameters:
    keyvaultName: ......
    tenantId: ......
    useVMManagedIdentity: 'true'
    userAssignedIdentityID: .....
    objects: |
      array:
        - |
          objectName: ...
          objectType: secret

and it could be easily extended with an array of fallback Key Vaults (or just once 🤷 )

parameters:
    keyvaultName: ......
    tenantId: ......
    userAssignedIdentityID: .....
    fallback:
    -  keyvaultName: ......
       tenantId: ......
       userAssignedIdentityID: .....
    objects: |
      array:
        - |
          objectName: ...
          objectType: secret

This approach would improve the resiliency of the component, just doing a fallback to other Azure Key Vault instances if there is any error on the primary instance without disruption the service.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

As csi volumes doesn't support being optional, problems related with the upstream will block pods scheduling (with a chance of huge impact in productive environments if this happens during high load peaks). I've reviewed csi-secret-store documentation and I've not found anything to handle these scenarios, but maybe I've missed something.

Environment:

enj commented 7 months ago

Linking the slack thread here for future reference.

enj commented 7 months ago

Writing down recommendations from the slack thread:

JorTurFer commented 7 months ago

This is the slack thread in sig-storage.

I'd like to respectfuly say, all the options seem as: "do it from your side or go to another place". It is a fact that the component isn't resilient to any kind of disruption, which can be a no-go for productive scenarios.

Not storing the secrets in k8s API is the main reason for using the csi. Storing the secrets in k8s API instead of a fallback/failover/justanothercall it's already managed by other 3rd parties and it's quite less secure than using CSI.

JorTurFer commented 6 months ago

Hi again ✋ ! I've presented the topic in the SIG-Storage meeting (Jan 25th) where there's been a SIG lead and the conclusion from the SIG is that it's the CSI-drive (this component) who has to handle the failures and high availability features as making the volume optional, only moves the problem from the k8s layer to the application layer.

Is it now something open to discuss or doing it by myself the only option that's left?

Currently, the component isn't resilient to Azure Key Vault failures and it's a single point of failure indeed, which is a problem at least for us (and that's why we are willing to contribute with this)

JorTurFer commented 6 months ago

Hello @enj ! Is there any update realted with this?

JorTurFer commented 4 months ago

Hello @enj ! Have you had an opportunity to see this by chance?