MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
45 stars 22 forks source link

how to apply backpressure #1007

Open petersilva opened 6 months ago

petersilva commented 6 months ago

was discussing with @reidsunderland a situation where if a node in a cluster fails, we want the sender to it to stop consuming, letting other consumers of a shared queue take over the entire load if this sender's downstream is broken.

https://en.wikipedia.org/wiki/Backpressure_routing

In v2, backpressure applied naturally, since we processed one message at a time, and simply tried to deliver or download the same item forever. If a delivery was failing, we would never loop back to consume more from the queue.

with sr3, we have both download_retry, and post_retry queues, so that if individual transfers fail, we can keep going. The problem is one can put millions of files in those retry queues, and a failing node may even consume from the queue faster because the failures may be quicker to process than successful deliveries.

So... there need to be some criteria when processings is going badly to stop consuming... that is, to apply backpressure to the upstream side.

petersilva commented 6 months ago

ideas:

petersilva commented 6 months ago
reidsunderland commented 6 months ago

Those ideas sound good.

Something we might also want to think about is acking the messages. If we knew there was another node that could process the message/send the file, maybe it’s better to not use the local retry queues at all and just not ack or nack messages when the transfer fails?

In that case, we might want to have sr3 automatically reduce messageRateMax and automatically reset it if the instance detects that the destination problem has been resolved.