Scaling down gRPC target fails as still forwards requests while draining

costap commented 10 months ago

Configuring a gRPC route for a target deployment. When scaling down that target, the killed pods transition to a draining state but lattice still seems to forward requests to this targets for new requests which causes the requests to fail. From my tests it takes up to 5 minutes for the pods to be removed from that target group.

Before scaling down:

After scaling down:

During this period some requests to the vpc lattice service fail with the below error, presumedly the ones sent to one of targets in draining state.

error getting server verion : rpc error: code = Unavailable desc = Service Unavailable

Once the draining pods are removed from target group after around 5 minutes, there are no more failed requests.

zijun726911 commented 10 months ago

Hi costap, Thanks for reporting this issue, I am able to reproduce this bug, if healthcheck disabled,The vpc lattice is still sending traffic to the draining targets. This is actually an vpc lattice bug but not the gateway controller issue. vpc lattice team are actively solving this issue. We will let you once this issue has been fixed.

zijun726911 commented 10 months ago

Hi @costap , VPC Lattice has rolled out this draining target bug fix for all regions , you could try again. Feel free to reopen this issue if you still meet this bug.

aws / aws-application-networking-k8s

Scaling down gRPC target fails as still forwards requests while draining #582