emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.35k stars 681 forks source link

Headless authservice endpoints not getting updated #5417

Open shrutilamba opened 10 months ago

shrutilamba commented 10 months ago

I am facing issues while scaling my authservice. My authservice pods frequently scale up an down, and intermittently I see issues where in the membership total isnt getting updated at ambassador. Example, my auth service pods scaled down from 21 to 20 pods, however, ambassador still taking the old value of 21, which is leading to 5xx at my service as one pod which is now terminated is still being considered by ambassador. Seems like some lag in service discovery. Does anyone has an idea what could be the issue here? I am using endpoint resolver and using a headless service for authservice. Using version v3.7

However, if I move to service resolver, this works correctly and endpoints are getting correctly updated. Any workaround to make it work with endpoint resolver?

alex-richman-onesignal commented 1 month ago

We're seeing the same issue, AuthService configured like this:

Spec:
  ambassador_id:
    --apiVersion-v3alpha1-only--emissary-staging
  auth_service:        auth-grpc-external.auth-grpc-staging
  failure_mode_allow:  true
  include_body:
    allow_partial:   true
    max_bytes:       1048576
  Proto:             grpc
  protocol_version:  v3
  status_on_error:
    Code:      403
  timeout_ms:  1000

Pointing at a headless service, emissary configured to use the endpoint resolver.

When pods in the auth-grpc service are rotated, e.g. due to a deployment, Emissary does not update the IP addresses for the pods in the extauth service cluster. If I exec into the Emissary pod during this and check the ambex snapshots it shows new snapshots still with the old pod ips.

This only seems to be a problem with the auth service. We have a handful of other services using the same k8s service setup being routed to by the endpoint resolver without issue and not retaining old IPs when pods are rolled.

Changing Emissary to use the k8s service resolver solves this problem, but we don't want to do that because of the poor load balancing.

Restarting Emissary naturally gets the new pod IPs and fixes it temporarily until the next auth service deploy. Waiting 1+ hours does not fix the issue. Deleting the AuthService CRD then re-adding it 5 minutes later does not fix the issue, but deleting it and readding it 1+ hour later does fix the issue, so Emissary seems to be expiring some internal cache at somepoint.

All the generated Envoy config looks correct, so I would think that the issue lies wthin the ambex endpoint resolver specific to auth services.

Emissary version 3.9.1, deployed within GKE.