aws / copilot-cli

The AWS Copilot CLI is a tool for developers to build, release and operate production ready containerized applications on AWS App Runner or Amazon ECS on AWS Fargate.
https://aws.github.io/copilot-cli/
Apache License 2.0
3.51k stars 414 forks source link

proposal: add new service pattern "Worker Service" #2550

Closed xar-tol closed 3 years ago

xar-tol commented 3 years ago

Worker Service Description

Copilot currently supports two main types of workload patterns: long running services and scheduled jobs. Some services can receive incoming connections, such as Load Balanced Web Services or Backend Services which expose a port. A new pattern suggested by customers is one which publishes to, subscribes to, and processes events asynchronously #1246, #2541.

We propose adding a new pattern "Worker Service" that will let the main service container consume events from a queue, which is populated by an SNS topic owned by another service.

Use Cases

Worker services will support the following use cases:

  1. Allow services to publish events. This would allow existing services to publish events to a destination where other services can consume these events asynchronously.
  2. Setup a service that will consume events in a queue. This would allow the new service to receive event messages asynchronously from other services in your application.

1. Worker Service Manifest

We will add a new type of service while initializing a service: "Worker service".

Full worker service example

For a full example, the worker service allows the typical Copilot service fields plus the subscribe field. The subscribe field allows for establishing an SQS Queue for the service to consume events.

name: merchants-worker
type: Worker Service

image:
  build: Dockerfile
cpu: 256    
memory: 512 
count: 1 
exec: true  

# New field that allows you to subscribe to events from other services in your application.
subscribe:
  # SNS topics from other services to subscribe to.
  topics:
    - name: inventory                # Name of the topic.
      service: products             # Name of the service that is publishing "inventory" events.

  # Optional fields for the subscribed SQS queue to the SNS topics.
  queue:
    retention: 345600s            # Optional. Amount of time to retain a message. Default 345600s.
    delay: 0s                     # Optional. Amount of time to delay before delivering messages in the queue. Default 0.
    timeout: 30s                  # Optional. Amount of time that a message will be unavailable after it is delivered from the queue. Default 30s.
    # Optional. Specify a dead-letter queue and number of processing attempts before message diversion. Default none.
    dead_letter:            
      queue_id: DeadLetterName    # Name of a pre-existing dead letter queue where messages are sent.
      tries: 10                   # Number of tries before sending a message to the dead letter queue.
    kms: true                     # Optional. Enable KMS encryption. Default false.

Minimum subscribe field

The minimal configuration needed is to have 'topics' specified. All other values will default to opinionated best-practices values which should work for most customers.

subscribe:
  topics:
    - name: inventory
      service: products

2. Publishing Events

We will add a new optional field publish to all existing services or jobs. This field will be available as a new field for all jobs and services within Copilot, including the worker service itself. This will allow the creation of an SNS topic or an EventBridge eventbus in the future to publish service events.

Publish Field with Workers

The publish field will specify the resource used to publish events (SNS) and the allowed workers that can subscribe to these events. When you specify a publish entry, Copilot will create a SNS topic with the given name, and generate the necessary policies for the allowed_workers to subscribe to the events.

# In products/manifest.yml

publish:
  topics:                  
    - name: inventory        # Name of the topic.
      allowed_workers:       # Optional. List of names of workers that can subscribe to inventory events from the products service.
        -  merchants-worker

Minimum Publish Field

The minimum for the publish field is the name of the topic. When no workers have been created yet or you are not sure of the worker name, you can set up topics with no subscriptions:

publish:
  topics: 
    - name: inventory

3. Deploy Experience

The deploy experience for a worker service involves (1) setting up the publishers and (2) subscribing the worker service to published events.

1. The publishing service must be deployed.

The publishing service must include the SNS Topics desired and names or soon-to-be-names of the workers that they will publish to in order for the worker service to have access to it.

2. The worker service must be deployed.

The worker service must include the names of the publishers it subscribes to in order to actually receive messages from them. Upon deployment completion, the worker service will give the SQS Queue URIs.

Initialization of Worker Service

copilot svc init
Which service type best represents your services architecture? [Use arrows to move, type to filter, ? for more help]
> Worker Service

What do you want to name this Worker Service? [? for help] hello

Which dockerfile would you like to use for hello? [Use arrows to move, type to filter, ? for more help]
./Dockerfile

Deployment of Worker Service

copilot svc deploy --name hello --env test
...

✔ Proposing infrastructure changes for stack testapp-test-hello
- Creating the infrastructure for stack testapp-test-hello  [create complete]
- Service discovery for your services to communicate within the VPC  [create complete]
- Update your environments shared resources [create complete]
- An IAM Role for the Fargate agent to make AWS API calls on your behalf [create complete]

✔ Deployed hello, its service discovery endpoint is hello.testapp.local:80.

Update hello's code to leverage the injected environment variable COPILOT_QUEUE_URIS.
For example, in JavaScript, you can write 'const {SQSQueue1, DeadLetterName, SQSQueue2, SQSQueue2DeadLetter} = JSON.parse(process.env.COPILOT_QUEUE_URIS)'.
bvtujo commented 3 years ago

We'd love your feedback as we continue to refine our design and start to work on this pattern! If you use SQS for asynchronous event processing, can you answer some of the following questions about your experience? Or, if your architecture looks nothing like this, we'd love to know about that, too.

metaskills commented 3 years ago

👋🏼 Not a Copilot user but interested in this topic in general. I do have experience running Rails based container workloads on both K8s EKS and Lambda. When using Lambda, we have it easy to since we use managed SQS polling and/or native EventBridge events. More details here on Rails, Lambda, ActiveJob, SQS, etc.

Do you use SNS topics and SQS queues to decouple microservices today? What does your async architecture look like?

Yup, this might provide an idea on a future state with equal footing in Lambda. For context, our decoupled services today rely heavily on an aging AMQP with managed sidecars to poll/listen for new messages. This includes using a Rails job system like ActiveJob with a Redis store. In a future state we really want to go to managed polling, services, delivery and reduce Redis, AMQP, etc.

Do you use any sidecars to forward/transform SQS messages?

No. When using ActiveJob with SQS with Lambdakiq or other ActiveJob software that uses a SQS backend, the app needs to get the message as delivered for the software to work.

So some questions or interests I'd have in seeing how y'all work this out would be something like... would the solutions you implement be coupled to Copilot or would they leverage some shared AWS platform capabilities that others could tap into?

ASIDE: One thing I learned about when using native SQS polling with Lambda is that the concurrency is limited depending on your error rates. Even a small percentage of errors (https://github.com/customink/lambdakiq#concurrency--limits) can drastically reduce your concurrency. Not sure how you will implement things, but wanted to share that a common pattern of background jobs is dealing with failures in a positive way via a retry backoff.

toricls commented 3 years ago

This may be slightly off topic but I have a quick question - how can each service team have clear ownership boundaries for shared-between-teams-ish AWS resources such as SQS queues or SNS topics? In my understanding, shared resources are created and managed in the "Environment" level in the current Copilot concepts but the resources I mentioned above will be/should be in the "Service" level I think. So my question is - is there any mechanism in the subscribe and the publish field concepts to define clear ownership boundaries on the shared-ish resources for services and service teams?

bvtujo commented 3 years ago

@toricls our architectural principles are the following:

  1. Publishers own their SNS topics and can authorize other microservices to subscribe to them.
  2. Subscribers own their queues (by default, one queue per service, but up to one queue per subscribed SNS topic).
  3. There should be no hard dependencies between microservices.

This means that one service (publisher) can set up its manifest as follows and run copilot svc deploy.

publish:
  - name: database-events
    allowed_workers: [worker1]

This can be done even before worker1 exists, as we will authorize IAM actions between SQS/SNS with resource tags based on the service name. This will create an SNS topic named app-env-publisher-database-events-topic which will allow SQS queues with the copilot-service: worker1 resource tag to create subscriptions.

Then, worker1 can do this to create a queue specific to the database-events topic.

subscribe:
  topics:
    - name: database-events
      service: publisher
      queue: 
        fifo: true

Or, to customize the service-level shared queue:

subscribe:
  topics:
    - name: database-events
      service: publisher
  queue:
    fifo: true

Services can be deleted in any order.

Does that answer your questions?

toricls commented 3 years ago

Thanks @bvtujo! Absolutely YES your comment answers my questions and explains a lot more than I originally thought! 😂 I totally agree with the principles and believe it gives users have better ownership models on the resources.

mtoftum commented 3 years ago

Looking at this proposal and finding the options around subscribe and publish a little confusing tbh. As I understand it, you can read from SNS and SQS and publish to SNS and event bridge? I'm a a little confused why subscribing to SNS is required and why you can't publish to SQS?

Just as a thought experiment: what if it was a basic worker by default and then optional support for subscribing and publishing to different event services that can be expanded with new types over time. Also make it possible to either subscribe/publish to an existing service or create a new.

subscribe:
  SNS:
    - name: sns-test
      create: false
  SQS:
    - name: incoming
      create: true
      ...
publish:
  SNS:
  ...

(sorry if my yaml look wrong, but I've not reached peak caffeine levels yet) I would start out with a minimum of SNS and SQS but make it easy to add others and I would allow creating workers that don't subscribe or publish to anything.

Side note: for something like dead_letter: it would be nice to have an option to auto create a deadletter queue based on the queue being created.

bvtujo commented 3 years ago

Hi @mtoftum!

Reading through your questions, it looks like this proposal actually addresses most of them! Our solution is constrained by the requirement that you should be able to delete any worker or publisher in any order (that is, no resources used by either a publisher or a subscriber should create a hard dependency between those services).

To answer some of your questions:

  1. Why can't I publish to SQS directly? We do this to a) ensure that all messages are delivered to all subscribers and b) allow messages to be delivered to multiple workers for different purposes. Using SNS for publishers address b) out of the box: SNS delivers a message to all subscribers, and the stack which creates the topic doesn't have to manage those subscriptions. For a), if we published directly to SQS, then that limits our ability to subscribe multiple workers to the queue and guarantee that they'll all receive all messages. For example, if we published directly to one SQS queue and authorized worker1, which processes those events and uses them to make a database request, and worker2, which aggregates the events into metrics, then messages consumed and successfully processed by worker1 would NEVER be seen by worker2. Publishing to SNS means that both workers have queues which get filled by the SNS topic and both get the change to process all messages.
  2. What if it was a basic worker by default and then optional support for subscribing and publishing to different event services over time? This is essentially what we plan to do! The subscribe field is optional, but when it's specified we will create either one SQS queue for the service, or an SQS queue for each topic the service is subscribed to.
  3. it would be nice to have an option to auto create a deadletter queue This design also includes that! If you wanted to create a deadletter queue for a subscriber to an SNS topic, you could do:
    subscribe:
    topics: 
    -name: database-events
     service: api
    queue:
    retention: 4d            
    delay: 0s                     
    dead_letter:            
      tries: 10             # Number of redrives before sending a message to the dead letter queue.

    This will create a queue for the worker, subscribe the queue to the database-events SNS topic from the api service, and create a dead-letter queue which the queue will drop messages into after 10 failed processing attempts.

Does that help clarify what we're doing?

shaktek commented 3 years ago

@bvtujo

Architecturally, we manage the provision and access of SNS, SQS independently of the services that consume their messages (using Terraform). This allows us to manage IAM access centrally and publish discovery configurations consistently.

This loose coupling between sources and consumers also gives us the flexibility to migrate consumers as needed. For e.g. we can change the compute platform that suites the most as AWS releases them :). For e.g. EC2 -> Containers, Containers -> Lambda without any downtime for producers and ease of rollback.

Adding auto-scaling with this feature will add value for customers (like us) that have large/spiky loads and have to manually do it in Terraform. Ticket 1933 is already opened on it.

Do you use SNS topics and SQS queues to decouple microservices today? What does your async architecture look like?

Yes. We have SNS -> Consumers, SQS -> Consumers, SNS -> SQS->Consumers and Kinesis -> Consumers.

Do you use any sidecars to forward/transform SQS messages?

No.

Do you use EventBridge eventbuses?

Planned.

Do multiple services publish into the same eventbus or is it a separate bus per service?

Same eventbus.

Do you have a way of structuring events so that they can be picked up by the right target using event patterns?

Yes, centralised libraries used by services that produce messages in CloudEvents standard.

efekarakus commented 3 years ago

The Worker Service pattern is now released! https://github.com/aws/copilot-cli/releases/tag/v1.10.0 🥳