aws / aws-node-termination-handler

Gracefully handle EC2 instance shutdown within Kubernetes
https://aws.amazon.com/ec2
Apache License 2.0
1.63k stars 266 forks source link

No SQS retry on "read: connection reset by peer" #717

Open eugenea opened 2 years ago

eugenea commented 2 years ago

Describe the bug NTH does not retry request over AWS SDK API to retrieve SQS queue message.

Steps to reproduce Close firewall to SQS AWS endpoint and try to monitor for SQS events.

Expected outcome The network layer cannot be guaranteed to be reliable so need to implement retry logic here.

Application Logs

WRN There was a problem monitoring for events error="RequestError: send request failed\ncaused by: Post \"https://sqs.us-west-2.amazonaws.com/\": read tcp 100.100.xx.xx:xxxx->10.xx.xx.xx:443: read: connection reset by peer" event_type=SQS_TERMINATE

Environment

The check that denies retry is here For V1 of AWS SDK the fix should be custom retryer which re-implements should retry, and custom retryer should be injected here, however upgrade to V2 of AWS SKD should fix this issue automatically, because it does not make distinction between different kinds of connection reset and retries them all which is desired behavior here.

snay2 commented 1 year ago

Thank you for the suggestion! Upon first reading, we would favor doing the upgrade to v2 of the AWS SDK if it can handle this logic automatically.

eugenea commented 1 year ago

Do you have any timeline/plan for v2 upgrade?

snay2 commented 1 year ago

No firm timeline yet, but it's one of our ongoing projects at the moment.

jillmon commented 1 year ago

@eugenea, the beta version of the NTH v2 upgrade has recently been released. Have you had a chance to investigate whether then new SDK can handle this use case?