Closed xar-tol closed 3 years ago
We'd love your feedback as we continue to refine our design and start to work on this pattern! If you use SQS for asynchronous event processing, can you answer some of the following questions about your experience? Or, if your architecture looks nothing like this, we'd love to know about that, too.
👋🏼 Not a Copilot user but interested in this topic in general. I do have experience running Rails based container workloads on both K8s EKS and Lambda. When using Lambda, we have it easy to since we use managed SQS polling and/or native EventBridge events. More details here on Rails, Lambda, ActiveJob, SQS, etc.
Do you use SNS topics and SQS queues to decouple microservices today? What does your async architecture look like?
Yup, this might provide an idea on a future state with equal footing in Lambda. For context, our decoupled services today rely heavily on an aging AMQP with managed sidecars to poll/listen for new messages. This includes using a Rails job system like ActiveJob with a Redis store. In a future state we really want to go to managed polling, services, delivery and reduce Redis, AMQP, etc.
Do you use any sidecars to forward/transform SQS messages?
No. When using ActiveJob with SQS with Lambdakiq or other ActiveJob software that uses a SQS backend, the app needs to get the message as delivered for the software to work.
So some questions or interests I'd have in seeing how y'all work this out would be something like... would the solutions you implement be coupled to Copilot or would they leverage some shared AWS platform capabilities that others could tap into?
ASIDE: One thing I learned about when using native SQS polling with Lambda is that the concurrency is limited depending on your error rates. Even a small percentage of errors (https://github.com/customink/lambdakiq#concurrency--limits) can drastically reduce your concurrency. Not sure how you will implement things, but wanted to share that a common pattern of background jobs is dealing with failures in a positive way via a retry backoff.
This may be slightly off topic but I have a quick question - how can each service team have clear ownership boundaries for shared-between-teams-ish AWS resources such as SQS queues or SNS topics?
In my understanding, shared resources are created and managed in the "Environment" level in the current Copilot concepts but the resources I mentioned above will be/should be in the "Service" level I think.
So my question is - is there any mechanism in the subscribe
and the publish
field concepts to define clear ownership boundaries on the shared-ish resources for services and service teams?
@toricls our architectural principles are the following:
This means that one service (publisher
) can set up its manifest as follows and run copilot svc deploy
.
publish:
- name: database-events
allowed_workers: [worker1]
This can be done even before worker1
exists, as we will authorize IAM actions between SQS/SNS with resource tags based on the service name. This will create an SNS topic named app-env-publisher-database-events-topic
which will allow SQS queues with the copilot-service: worker1
resource tag to create subscriptions.
Then, worker1
can do this to create a queue specific to the database-events
topic.
subscribe:
topics:
- name: database-events
service: publisher
queue:
fifo: true
Or, to customize the service-level shared queue:
subscribe:
topics:
- name: database-events
service: publisher
queue:
fifo: true
Services can be deleted in any order.
Does that answer your questions?
Thanks @bvtujo! Absolutely YES your comment answers my questions and explains a lot more than I originally thought! 😂 I totally agree with the principles and believe it gives users have better ownership models on the resources.
Looking at this proposal and finding the options around subscribe and publish a little confusing tbh. As I understand it, you can read from SNS and SQS and publish to SNS and event bridge? I'm a a little confused why subscribing to SNS is required and why you can't publish to SQS?
Just as a thought experiment: what if it was a basic worker by default and then optional support for subscribing and publishing to different event services that can be expanded with new types over time. Also make it possible to either subscribe/publish to an existing service or create a new.
subscribe:
SNS:
- name: sns-test
create: false
SQS:
- name: incoming
create: true
...
publish:
SNS:
...
(sorry if my yaml look wrong, but I've not reached peak caffeine levels yet) I would start out with a minimum of SNS and SQS but make it easy to add others and I would allow creating workers that don't subscribe or publish to anything.
Side note: for something like dead_letter: it would be nice to have an option to auto create a deadletter queue based on the queue being created.
Hi @mtoftum!
Reading through your questions, it looks like this proposal actually addresses most of them! Our solution is constrained by the requirement that you should be able to delete any worker or publisher in any order (that is, no resources used by either a publisher or a subscriber should create a hard dependency between those services).
To answer some of your questions:
worker1
, which processes those events and uses them to make a database request, and worker2
, which aggregates the events into metrics, then messages consumed and successfully processed by worker1
would NEVER be seen by worker2
. Publishing to SNS means that both workers have queues which get filled by the SNS topic and both get the change to process all messages.subscribe
field is optional, but when it's specified we will create either one SQS queue for the service, or an SQS queue for each topic the service is subscribed to. subscribe:
topics:
-name: database-events
service: api
queue:
retention: 4d
delay: 0s
dead_letter:
tries: 10 # Number of redrives before sending a message to the dead letter queue.
This will create a queue for the worker, subscribe the queue to the database-events
SNS topic from the api
service, and create a dead-letter queue which the queue will drop messages into after 10 failed processing attempts.
Does that help clarify what we're doing?
@bvtujo
Architecturally, we manage the provision and access of SNS, SQS independently of the services that consume their messages (using Terraform). This allows us to manage IAM access centrally and publish discovery configurations consistently.
This loose coupling between sources and consumers also gives us the flexibility to migrate consumers as needed. For e.g. we can change the compute platform that suites the most as AWS releases them :). For e.g. EC2 -> Containers, Containers -> Lambda without any downtime for producers and ease of rollback.
Adding auto-scaling with this feature will add value for customers (like us) that have large/spiky loads and have to manually do it in Terraform. Ticket 1933 is already opened on it.
Do you use SNS topics and SQS queues to decouple microservices today? What does your async architecture look like?
Yes. We have SNS -> Consumers, SQS -> Consumers, SNS -> SQS->Consumers and Kinesis -> Consumers.
Do you use any sidecars to forward/transform SQS messages?
No.
Do you use EventBridge eventbuses?
Planned.
Do multiple services publish into the same eventbus or is it a separate bus per service?
Same eventbus.
Do you have a way of structuring events so that they can be picked up by the right target using event patterns?
Yes, centralised libraries used by services that produce messages in CloudEvents standard.
The Worker Service pattern is now released! https://github.com/aws/copilot-cli/releases/tag/v1.10.0 🥳
Worker Service Description
Copilot currently supports two main types of workload patterns: long running services and scheduled jobs. Some services can receive incoming connections, such as Load Balanced Web Services or Backend Services which expose a port. A new pattern suggested by customers is one which publishes to, subscribes to, and processes events asynchronously #1246, #2541.
We propose adding a new pattern "Worker Service" that will let the main service container consume events from a queue, which is populated by an SNS topic owned by another service.
Use Cases
Worker services will support the following use cases:
1. Worker Service Manifest
We will add a new type of service while initializing a service:
"Worker service"
.Full worker service example
For a full example, the worker service allows the typical Copilot service fields plus the
subscribe
field. The subscribe field allows for establishing an SQS Queue for the service to consume events.Minimum subscribe field
The minimal configuration needed is to have 'topics' specified. All other values will default to opinionated best-practices values which should work for most customers.
2. Publishing Events
We will add a new optional field
publish
to all existing services or jobs. This field will be available as a new field for all jobs and services within Copilot, including the worker service itself. This will allow the creation of an SNS topic or an EventBridge eventbus in the future to publish service events.Publish Field with Workers
The publish field will specify the resource used to publish events (SNS) and the allowed workers that can subscribe to these events. When you specify a
publish
entry, Copilot will create a SNS topic with the given name, and generate the necessary policies for theallowed_workers
to subscribe to the events.Minimum Publish Field
The minimum for the publish field is the name of the topic. When no workers have been created yet or you are not sure of the worker name, you can set up topics with no subscriptions:
3. Deploy Experience
The deploy experience for a worker service involves (1) setting up the publishers and (2) subscribing the worker service to published events.
1. The publishing service must be deployed.
The publishing service must include the SNS Topics desired and names or soon-to-be-names of the workers that they will publish to in order for the worker service to have access to it.
2. The worker service must be deployed.
The worker service must include the names of the publishers it subscribes to in order to actually receive messages from them. Upon deployment completion, the worker service will give the SQS Queue URIs.
Initialization of Worker Service
Deployment of Worker Service