aws-samples / lambda-ecs-worker-pattern

This example code illustrates how to extend AWS Lambda functionality using Amazon SQS and the Amazon EC2 Container Service (ECS).
Apache License 2.0
290 stars 45 forks source link

How to do this without SQS? #1

Closed astewart-twist closed 8 years ago

astewart-twist commented 8 years ago

I really dislike the SQS component here. Queueing up pending tasks that can't yet be assigned an ECS instance should be a natural function of the ECS schedular.

Any thoughts on how that might be done?

glez-aws commented 8 years ago

Hi astewart-twist,

thanks for your comment!

Yes, you can eliminate the SQS component by running the ECS task and supplying the parameters for it through setting environment variables. Check out the "overrides" parameter of the runTask method here: http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/ECS.html#runTask-property

The reason I put the SQS queue here was to have a generic way of forwarding any type of event to the ECS task in an easy way.

So if your particular application uses event data that can be easily passed to an ECS task through setting environment variables, you should be good to go.

Hope this helps, Constantin

astewart-twist commented 8 years ago

Thanks for the reply @glez-aws. Actually the more I think about it I think I'm coming around to the idea of maintaining a dedicated SQS queue for ECS tasks. Still though, I think ideally it should be a function of the ECS cluster scheduler. If all resources across my ECS cluster are occupied, any additional task requests will simply be rejected.

In my current setup I'm already passing inputs to my ECS tasks via overrides. Works pretty well. I have a lambda function receiving events and attempting to run ECS tasks. Again, though, if the cluster is at capacity the task request is declined. I've added a retry loop but obviously the Lambda function itself will time out after 5 minutes. Using SQS as a persistent task queue would certainly prevent tasks from dropping completely, but a pure ECS solution I think would be really nice. I imagine either the ECS cluster itself maintains a (cluster-level) queue, or maybe ECS instance agents can be enabled to receive task requests even if over capacity but the tasks remain in pending state until resources on that ECS instance are free.

From what I understand something like this currently isn't possible, correct?

glez-aws commented 8 years ago

Hi,

I think there are two components to keep in mind here:

  1. Job mangement (to distinguish from the ECS "task" concept): Handle a bunch of jobs to be done and make sure each one is completed.
  2. Worker management: Make sure that adecuate compute capacity is allocated to completing jobs.

In the Scenario above, Lambda essentially does both: It writes event data into an SQS queue (to help with 1) and kicks off an ECS task to drain the Queue, to address 2.

Since this is just an example, both ways to address job management and worker management can be improved. Traditionally, whole sofware companies could be dedicated to either one of these aspects.

Today, we can simply plug Cloud services together to accomplish what we need. Since SQS is easy to use and easy to integrate, we would recomment to just use it together with ECS instead of adding yet another task scheduling mechanism as part of ECS: That’s what the building block philosophy is about.

The missing piece in your setup is a way to automatically scale the number of running ECS tasks and a means to make sure there’s always at least one ECS task running that can drain your SQS queue.

ECS supports the notion of an ECS "service" that makes sure that a user-specified number of ECS tasks are running at all times. Have you considered implementing your ECS task as an ECS service?

You could then add automatic scaling by configuring a second Lambda function to be triggered by a CloudWatch alarm that watches the number of visible (and thus unprocesses) SQS messages so it would increase the number of tasks for processing messages during periods of high load, then decrease up to 1 when the queue is seeing less messages.

Hope this helps, Constantin

glez-aws commented 8 years ago

Hi,

here’s a link to the ECS documentation on Services: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html

Cheers, Constantin

glez-aws commented 8 years ago

Closing this issue since it's more of a discussion topic and I believe some answers were offered.

astewart-twist commented 8 years ago

@glez-aws Thanks for the great info! Especially w.r.t. SQS+lambda+autoscaling. I've swapped out my earlier SNS-ping-pong approach for job request persistence and moved to a proper SQS queue approach.