Particular / NServiceBus.AmazonSQS

An AWS SQS transport for NServiceBus.
Other
36 stars 20 forks source link

Use CloudWatch Events to manage delayed messages #244

Open mauroservienti opened 6 years ago

mauroservienti commented 6 years ago

spun off from #191

Quoted from https://github.com/Particular/NServiceBus.AmazonSQS/issues/191#issuecomment-424955637

The idea is to use CloudWatch Events to trigger the timeout - this allows an AWS native service to take ownership of the timing of the trigger without having a complex algorithm of repeatedly sending messages, nor requiring FIFO queues.

A simple algorithm could be:

This would permit indefinite timeouts without FIFO queues, without satellite queues and without special algorithms to check and resend the message every 15 minutes.

danielmarbach commented 6 years ago

Here is my original internal comment when we considered multiple options to implement native deferral

Another angle that might be worth considering is AWS Cloudwatch.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html

With cloudwatch it is possible to schedule events at a given rate or time. CloudWatch natively supports SQS and or SNS as a target.

Caveats though:

You can create rules that self-trigger on an automated schedule in CloudWatch Events using cron or rate expressions. All scheduled events use UTC time zone and the minimum precision for schedules is 1 minute.

CloudWatch Events does not provide second-level precision in schedule expressions. The finest resolution using a cron expression is a minute. Due to the distributed nature of the CloudWatch Events and the target services, the delay between the time the scheduled rule is triggered and the time the target service honors the execution of the target resource might be several seconds. Your scheduled rule is triggered within that minute, but not on the precise 0th second.

an API is available.

It would require to create trigger rules with an SQS target. It looks like the event details is fixed but it is possible to pass arbitrary json to the target. I haven't done more investigation. Just an idea to consider or leave out. I leave that up to the TF

image

image

The rule management might not be trivial. We realized it would require many rules and you are only allowed to have 100 rules per account per region which makes cloudwatch a nogo

chrissimon-au commented 6 years ago

Hi @danielmarbach - thanks for posting, that's a great analysis!

I can add some info that may or may not adjust your thinking:

100-rule limit

You can request an increase on the rule limit - I have checked and in our region (ap-southeast-2) the hard limit is 2000 rules per account. This still may be prohibitive for some use cases.

1-minute resolution

In our case, this would be acceptable - as long as the SQS deferral (with more precise resolution I think?) was used for timeouts < 15 minutes. Once the timeout is > 15 minutes we tend not to have second-resolution requirements. So if the implementation could use native deferral if the timeout is < 15 minutes, and a cloudwatch event rule if > 15 minutes, that would be fine for us.

Rule management

I agree - it would not be trivial, however I don't think too complex either. I think the key would be to consider each rule to be associated with a specific datetime (minute) rather than a specific timeout event. Each rule can have up to 5 triggers associated with it, and each trigger can carry a payload with potentially multiple message Ids.

So, as timeouts are raised, the algorithm could be:

Outbound

When adding a message id to a rule:

When adding a message id to a payload:

(There may also need to be some concurrency handling to avoid losing message ids in a payload)

Inbound

Variable resolution

To accommodate the rule limit, it may be acceptable as a compromise to scale the resolution - e.g. within 24 hours support 1 minute resolution, after 24 hours support 5 minute resolution, and after 48 hours, support 1 hour resolution - so only 24 rules would be required to support any given hour on a given day.

Why bother?

Our main reason for being interested in this is that we can't use FIFO queues which are a requirement for the alternative algorithm which has already been implemented. We are looking at hangfire for recurring activity, but we still have some use cases where nservicebus timeouts are a better fit, for non-recurring timeouts that exceed 15 minutes.

I appreciate if our use case is not too common that there may not be enough appetite for this :)

danielmarbach commented 6 years ago

Hi @chrissimon-au

Thanks for the input. We will definitely discuss it but I have a hunch we'd rather wait for Amazon to support FIFO queues in your region. I pinged Amazon Support and I'm meeting tomorrow with an Amazon representative to see where they are. I'll keep you posted

Daniel

danielmarbach commented 4 years ago

Hi @chrissimon-au

I should have followed up earlier on this one. It seems that Asia Pacific supports FIFO queues since November 2018

https://aws.amazon.com/about-aws/whats-new/2018/11/amazon-sqs-fifo-asia-pacific-tokyo-sydney/

Have you been able to switch to FIFO queues?

Daniel