DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Throttles on retry lambda consume actual attempts #3708

Open hannes-ucsc opened 2 years ago

hannes-ucsc commented 2 years ago

Follow-up from https://github.com/DataBiosphere/azul/issues/3703#issuecomment-999196861

Refeeding several hundred failed notifications caused a high degree of throttles. A throttle occurs when AWS' internal SQS-Lambda integration machinery fetches more messages than can be handled with a given Lambda concurrency limit (64 in this case). The docs state that the machinery has 5 threads pulling from the queue in parallel. There is evidence for that in the "Empty receives" metric for SQS: we observe 15 = 5 * 60 / 20 empty receives per queue and minute if the system is idle. Each thread produces 3 empty receives per minute because it uses 20s long polling. The docs also state that the number of threads is ramped up and down on demand at a certain rate (up faster than down). If there are more than 64 notifications in the retry queue, the machinery will receive all messages but it will only be able to allocate Lambda executions for 64 of them. The rest will be returned to the queue with an incremented receive count that doesn't reflect the number of actual attempts to process the message. This is an unfortunate design choice by AWS. They could easily return the message without incrementing the receive count. Since the receive count is capped at 9 for the retry queues, we don't get an honest 9 attempts. We observed messages to end up in the fail queue with only four honest attempts. The hypothesis is that this occurs when the # of polling threads is still high from a previous reindex and hasn't come down yet. The solution may be to not refeed more than 64 messages at a time. This would require additional batching capability on the manage_queue.py script.

image

Shows an example that occurred during refeeding 400 messages to the retry notifications queue. The system made only three honest attempts, the rest were throttles.

melainalegaspi commented 2 years ago

Assignee to spike to design how this (re-feeding in batches of 64) should work in 1 paragraph.

melainalegaspi commented 2 years ago

Cancelling spike due to assignee being overloaded.