bbc / sqs-consumer

Build Amazon Simple Queue Service (SQS) based applications without the boilerplate
https://bbc.github.io/sqs-consumer/
Other
1.71k stars 330 forks source link

[Bug]: When consumer fails to connect to SQS due to connectivity instant retry occurs #490

Open fraserbenjamin opened 2 months ago

fraserbenjamin commented 2 months ago

Describe the bug

Apologies for raising another issue though in this case the key error case hasn't been resolved despite PR #485 which adds additional support for SQS Error types, currently if the consumer cannot access the queue due to internet connectivity or an invalid URL being passed in, it will throw an error that looks like the one below which won't be caught by isConnectionError and cause an infinite loop.

{"name":"SQSError","code":"Error","time":"2024-04-29T08:08:58.721Z"}

This is why in the previous PR I added code SQSError to the potential PR to resolve it or you could use code Error in the current version as its such a generic message output from the AWS SDK.

If the consumer cannot connect to SQS due to no internet connection or the unlikely event SQS is down, if pollingWaitTimeMs is 0 (set by default), it will instantly retry the connection causing a continual flurry of requests. There is currently no option to resolve this other than by setting pollingWaitTimeMs to a higher value however this also isn't ideal since it adds a longer wait between polls too. Currently authentication errors are caught and can have a back-off configured with authenticationErrorTimeout though this doesn't exist for general connection errors.

Your minimal, reproducible example

https://stackblitz.com/edit/github-pwpmkg?file=index.js

Steps to reproduce

Method 1 (Error)

  1. Clone the code from StackBlitz (essential as you can't disable connectivity in StackBlitz)
  2. Disable device internet connectivity
  3. Run the script using node index.js
  4. Error will occur spamming the logs and creating a high number of retries

Method 2 (TimeoutError)

  1. Open the codesandbox project
  2. Run the script using node index.js
  3. TimeoutError will occur spamming the logs and creating a high number of retries

Expected behavior

As a user I expected these retries to be throttled after a failure but instead I see requests continuously being made and error messages being sent to the console at an excessive rate.

How often does this bug happen?

Every time

Screenshots or Videos

Screenshot 2024-04-29 at 09 33 11

Platform

Occurs on all platforms and tested with the issue happening on Node versions 16-20.

Package version

10.1.0

AWS SDK version

3.564.0

nicholasgriffintn commented 2 months ago

Hey, yeah we can't react to the code Error or the error type SQSError overall as this would make every error from SQS a connection error that's not what we're looking to do.

One possible way of detecting might be to look for ENOTFOUND, but that's also a little hacky and not super supported.

I don't think we can't ultimately support every possible use case with this library, and we're not really looking to, it's main aim is to be simple boilerplate that we used internally and shared externally.

fraserbenjamin commented 2 months ago

Hey, yeah we can't react to the code Error or the error type SQSError overall as this would make every error from SQS a connection error that's not what we're looking to do.

One possible way of detecting might be to look for ENOTFOUND, but that's also a little hacky and not super supported.

I don't think we can't ultimately support every possible use case with this library, and we're not really looking to, it's main aim is to be simple boilerplate that we used internally and shared externally.

Yeah 100% agree and appreciate your time on this so far thanks, with high traffic that could cause a big delay if the odd message fails for a genuine reason. Totally understand not being able to support every use case although surely a malformed url or no connection are 2 very common cases especially if deploying to cloud where a malformed security group or bad env variable could then cause high cpu/memory and lots of potentially expensive logging to occur without people realising straight away. Especially since the loop can be every 1-2ms which is a lot of events.

In the first draft PR I raised I added another configurable option that a timeout happens on any errors which whilst not perfect I'd say is a better alternative than having the potential for the issues mentioned above. Would be keen to know your opinion on adding something like that back in, happy to do the work but want to check its the way you'd like to go first.