aws / aws-application-networking-k8s

A Kubernetes controller for Amazon VPC Lattice
https://www.gateway-api-controller.eks.aws.dev/
Apache License 2.0
175 stars 50 forks source link

Scaling down gRPC target fails as still forwards requests while draining #582

Closed costap closed 10 months ago

costap commented 10 months ago

Configuring a gRPC route for a target deployment. When scaling down that target, the killed pods transition to a draining state but lattice still seems to forward requests to this targets for new requests which causes the requests to fail. From my tests it takes up to 5 minutes for the pods to be removed from that target group.

Before scaling down:

Screenshot 2024-01-05 at 11 04 13

After scaling down:

Screenshot 2024-01-05 at 11 05 21

During this period some requests to the vpc lattice service fail with the below error, presumedly the ones sent to one of targets in draining state.

error getting server verion : rpc error: code = Unavailable desc = Service Unavailable

Once the draining pods are removed from target group after around 5 minutes, there are no more failed requests.

zijun726911 commented 10 months ago

Hi costap, Thanks for reporting this issue, I am able to reproduce this bug, if healthcheck disabled,The vpc lattice is still sending traffic to the draining targets. This is actually an vpc lattice bug but not the gateway controller issue. vpc lattice team are actively solving this issue. We will let you once this issue has been fixed.

zijun726911 commented 10 months ago

Hi @costap , VPC Lattice has rolled out this draining target bug fix for all regions , you could try again. Feel free to reopen this issue if you still meet this bug.