aws / aws-activejob-sqs-ruby

Apache License 2.0
3 stars 2 forks source link

nginx timeout for jobs that take more than 60 sec (default nginx timeout) #2

Open januszm opened 2 weeks ago

januszm commented 2 weeks ago

I'm currently migrating from Shoryuken task queue (a Sidekiq clone that uses SQS as queue store) to AWS SDK Rails, and I've encountered a strange error. I mention Sidekiq because it's a de-facto state of the art task queue for Ruby world. For longer jobs that require e.g. 5-10 minutes to complete, I get a timeout, and the message returns to the queue. It looks like the POST request to the application server is waiting for the task to complete, which is unacceptable for longer jobs, because some can take up to several hours. Am I doing something wrong, or is this software designed so that sqs daemon, when submitting a task to be executed using the POST method, waits for it to complete? I expected it to only wait for the task to be "accepted" for async execution, not for it to complete.

alextwoods commented 2 weeks ago

This is a tricky issue - when the worker returns a 200 ok the sqs daemon deletes the message. If the job fails, normal active record retries are applied and a new message will be queued.... but if the worker process is terminated / worker restarted, ect then the message will be entirely lost.

Additionally once a 200 ok is returned, the sqs daemon will post another request - potentially leading to overloading the worker (this would depend on how we ran the jobs async).

Given all that though - the 60 second timeout is a big issue. I beleive there are ways to configure this value to be higher (in nginx config / lb config)? I'll think on this a bit more. Possibly a configuration option that lets you process jobs async using a thread pool of some fixed/configured size?

See: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html#worker-daemon

januszm commented 2 weeks ago

See: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html#worker-daemon

response to acknowledge that it has received and successfully processed the request

I think I only represent certain "user group" but I'd expect this gem to respond 200 quickly when the job is accepted (ie. not wait for it to be processed), use a thread pool to process as many jobs as possible on a single server and not worry about worker failure and losing a message, this should be rare case. Job code should be responsible for handling retries in such a case and rescheduling another job with a new message. In this case I don't think me and my team can simply replace Shoryuken with this gem as it's a slightly different architecture and, again, it requires different approach, it's not a drop-in replacement. Jobs can take hours or at least dozens of minutes to complete.