Amazon SQS Stall - Githubissues

BarryCarlyon commented 8 years ago

Occasionally after periods of running fine, Node Amazon SQS will just "stop" and not issue the "no messages" after the timeout.

The issue is outlined here in this StackOverflow: http://stackoverflow.com/questions/37111431/amazon-sqs-with-aws-sdk-receivemessage-stall

Basically in summary:

Occasionally, my script is stalling because I don't get a response for Amazon at all, say for example there are no messages in the queue to consume and instead of hitting the WaitTimeSeconds and sending a "no messages object", the callback isn't called.

chrisradek commented 8 years ago

@BarryCarlyon Are you seeing these errors after a long period of time has passed? Can you share what version of the SDK/node.js you are using?

In the SO example, it looks like sqs.receiveMessage gets called on an interval of 500 ms. My initial thought is that enough of these requests get created that overtime a tipping point is reached where there are just too many sockets in use.

If the above is the case, there are a couple ways to solve this problem. One would be to configure the SDK to use a maximum number of sockets.

var sqs = new AWS.SQS({
  httpOptions: {
    agent: new https.Agent({
      maxSockets: 50 // number chosen arbitrarily
    })
  }
});

The maxSockets approach won't solve any issues you might be having with high memory usage though, as requests could queue up while waiting for a free socket.

The other thing you could do is wait for sqs.receiveMessage to return before calling it again. You would essentially just make sqs.receiveMessage call itself inside of the callback. You can call it within a setTimeout so that it's called on the next tick, and so that the stack size doesn't grow too large. This method would have the biggest positive impact on your memory usage, and you can still have multiple sqs.receiveMessage requests going in parallel.

BarryCarlyon commented 8 years ago

I'm looping every 1/2 a second but it only SQS fetches IF there isn't a SQS fetch running. So there shouldn't be any hanging sockets, or sockets running concurrently (well aside from calls to message delete but the running flag should handle that.

I'm now on 2.3.16 (previously unsure), and seeing the same issue, my watchdog timer last caught and force reset at "Wed Jun 01 2016 11:27:07 GMT+0100"

Yes it's after the job has been running for $some_time say a few days or so.

I can't say I've seen high CPU/Memory usage. New Relic hasn't caught anything abnormal (as it's monitoring the server).

So in summary, I should be waiting for the current sqs.receiveMessage to finish before I call it again.

Code block follows:

    var running = false;

    runMonitorJob = setInterval(function() {
        if (running) {
            // do nothing
        } else {
            running = true;

clearTimeout(watchdogTimeout);
watchdogTimeout = setTimeout(function() {
    console.log('WatchDog');
    running = false;
}, 120000);

            sqs.receiveMessage({
                QueueUrl: queueUrl,
                MaxNumberOfMessages: 10,
                WaitTimeSeconds: 20
            }, function(err, data) {
                if (err) {
                    logger.fatal('Error on Message Recieve');
                    logger.fatal(err);
                } else {
                    // all good
                    if (undefined === data.Messages) {
                        logger.info('No Messages Object');
                        timeCheck();
                    } else if (data.Messages.length > 0) {
                        logger.info('Messages Count: ' + data.Messages.length);

                        var delete_batch = new Array();
                        for (var x=0;x<data.Messages.length;x++) {
                            // process
                            receiveMessage(data.Messages[x]);

                            // flag to delete

                            var pck = new Array();
                            pck['Id'] = data.Messages[x].MessageId;
                            pck['ReceiptHandle'] = data.Messages[x].ReceiptHandle;

                            delete_batch.push(pck);
                        }

                        if (delete_batch.length > 0) {
                            logger.info('Calling Delete');
                            sqs.deleteMessageBatch({
                                Entries: delete_batch,
                                QueueUrl: queueUrl
                            }, function(err, data) {
                                if (err) {
                                    logger.fatal('Failed to delete messages');
                                    logger.fatal(err);
                                } else {
                                    logger.debug('Deleted recieved ok');
                                }
                            });
                        }

                    } else {
                        logger.info('No Messages Count');
                    }
                }

                running = false;
            });
        }
    }, 500);

logger is a call to log4js, and receiveMessage basically dumps off to Redis which I munch on later.

If I was to add the maxSockets what would I log to detect if I hit maxSockets? (Would it chuck a error somewhere?)

neoadventist commented 8 years ago

Any update on this?

chrisradek commented 6 years ago

Closing old issues. If you're still encountering this issue, please open a new issue and reference this one.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

aws / aws-sdk-js

Amazon SQS Stall #1005