Open rliskunov opened 3 months ago
In general, it seems as if the problem was not timeout, but ServiceAccount
Let's say we have two applications: api
and worker
A secret is generated for each of them, which allows to go to Vault. Example with api
- apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: argo-vault-api
namespace: argocd
stringData:
VAULT_ADDR: http://vault.vault.svc.cluster.local:8200
AVP_TYPE: vault
AVP_AUTH_TYPE: k8s
AVP_K8S_ROLE: argocd-api
The argocd-api
role is generated in Vault with the parameters
Bound service account namespaces - argocd-repo-server
Bound service account namespaces - argocd
Generated Token's Policies - api
Pod argocd-repo-server
uses ServiceAccount argocd-repo-server
. When we do Hard Refresh in ArgoCD for api
, it's as if ServiceAccount argocd-repo-server
clings to the argo-vault-api
secret, losing connections to Vault for argo-vault-worker
If we reboot the argocd-repo-server
pod and do a Hard Refresh for the worker
, then we lose the api
So when we used a universal role that has access to all secrets, we didn't encounter this problem
We are seeing a similar issue as we have a similar setup.
We have actually troubleshooted inside the avp-helm (in our case) sidecar container that we are using as part of the repo-server. It seems to us, that when using different AppRoles within the same sidecar, there is an issue with the token caching.
The concept is briefly discussed here: https://argocd-vault-plugin.readthedocs.io/en/stable/usage/#caching-the-hashicorp-vault-token
We believe, that there is a race-condition, whoever comes first to refresh a token (default lifetime is 20min), gets to execute. This gets a bit of additional randomness from having two repo-server instances and two sidecars therefore at the same time.
This is further supported by our discovery that this never happens for our second sidecar with avp that always uses the same secret and that we can always reproduce this by running a hard refresh for all of our applications (we are using 10+ different AppRoles in our case).
Describe the bug Periodically the plugin loses connection to Vault. In this way, after configuration the plugin works correctly, but after 15-20 minutes the connection is lost. Hard Refresh of the app does not help. However, If you restart
argocd-repo-server
andargocd-redis
, everything works successfully. If you restart one of them, the problem does not solve.I use Multitenancy with Kubernetes Authentication
To Reproduce
If you want to reproduce this, you will need the following:
Install Vault in a Kubernetes cluster
Enable Kubernetes authorization in Vault
Add policy to Vault -
argocd-policy
Add a role to Vault -
argocd-role
, specifying the parametersAdd a secret to Kubernetes in values.yaml for ArgoCD Helm Chart
Expected behavior
If you configure a connection to Vault for an application once, the connection will work stably.
Screenshots/Verbose output
Example of output
Additional context If you don't use Multitenancy, but make the most insecure policy possible, the connection is stable.