New queue type to configurably drop events to avoid upstream blocking from downstream back pressure

geekpete commented 4 years ago

We already have Persistent Queue but this queue type will still block if it's full.

Another queue type for the purpose of avoiding back pressure in exchange for discarding events would be useful for a number of use cases.

It'd probably be better to implement a new queue type to keep the focus simple rather than try to extend PQ for this purpose, though do see the related issue around PQ included at the end of this issue.

Use cases that could benefit from this queue type might include the patterns described in pipeline-to-pipeline docs:

https://www.elastic.co/guide/en/logstash/current/pipeline-to-pipeline.html
especially the Output Isolator pattern where you don't want one output blocking others and the current suggestion is to use a PQ which only avoids blocking until the persistent queue is full, a lossy queue would always avoid blocking at the expense of losing some events
the Distributor and Forked Path patterns would likely also benefit

Real world examples of some patterns might be:

siphoning production events into a non-production environment for testing but ensuring any problem with the non-production environment that would normally cause blocking has no negative impact on the production environment
having multiple types of data with different performance/durability considerations where one durability tolerance could be losing events in exchange for avoiding blocking
use cases where ensuring data flow outweighs losing data within finite resources, for example an alerting use case that must continue to alert on "current" data and any backlog will hold up alerting and cause data to be stale which is then not worth alerting on. This might also align to the related issue posted at the end, though the difference here vs that other issue is lossy vs lossless being preferred.

Configuration that might be nice to have:

when the queue hits capacity, prefer to discard the oldest events, a circular queue
when the queue hits capacity, prefer to discard the newest events (eg prevent further publishing to the queue)
optionally discard random events on queue full, the leaky bucket
combinations of these 3 discard methods with a configurable throttle rate
- eg immediately stop publishing once queue capacity is hit then discard from the front/back/randomly at particular rates that might be configured as "events per " rates or discard percentages for each method, eg hitting capacity triggers discard of 10% of events from the front, 20% from the back and 50% randomly then re-evaluate queue capacity. Discarding from the front in batch might suit replayability from some other data store with recent limited retention.
perhaps even the ability to zero the queue entirely upon hitting capacity might also suit some use cases and be more efficient on resources
an API to empty the queue or some of the queue
- with parameter to specify event count to discard
- with parameter to specify discarding oldest/newest/random events
it might also be interesting to have an easily configurable way for leaky queues to inter-operate around disk usage such as a watermark trigger system based on available disk usage or having queue groups with max size per group that multiple queues can operate within.
- downsides to this might be one queue's longer/larger back pressure causing other queues to purge backlogged events earlier so potentially unpredictable discarding patterns
- an upside might be the ability to utilise the most queue storage with more minimal management than having to configure queues granularly, one queue can backlog a lot one day, another queue can backlog the next, but all queues have the ability to use all shared storage if needed and they get priority to use storage if they backlog first or faster.

There's potentially lots of functionality that might be built that dedicated message brokers already do and more, but starting simpler with the three functions of being able to discard events when the queue hits full from either:

the front of the queue
the back of the queue
empty the entire queue

would cover a lot of cases to begin with without going over the top on functionality for the first iteration of this feature.

A related issue:

ability to always prefer processing most recent events first from PQ but without any discard of data
- Logstash persistent queue inhibits the ability to see near real time data

aalleexxx5 commented 3 years ago

I needed to remove backpressure, but didn't find an elegant solution. I wanted to leave my current ~hack~ solution here, in case someone else has the same issue. It uses the stats api in a pipeline to check the queue size for each event and then drops events if the queue is above a threshold.

The Logstash stats API only updates once every 5 seconds, which means that the pipeline manages to completely empty its queue before the size is updated, and that I need a very healthy margin for the queue size threshold.

How to empty a queue based on event count:

input {
    pipeline {
        address => cloudQueue
    }
}

filter {
    http {
        url => "http://localhost:9600/_node/stats/pipelines/cloud_queue"
        target_body => "[@metadata][stats]"
        target_headers => "[@metadata][statsHeaders]"
        add_field => {"queueSize" => "%{[@metadata][stats][pipelines][cloud_queue][queue][events]}" }
    }
    mutate {
        convert => {
            "queueSize" => "integer"
        }
    }
}
output {

# Queue size limit
# Assuming 2000 bytes per event, 644245 events takes up 1.2GiB.
# We need a healthy margin, since the 2000 bytes is approximate and update rate of 5 seconds.

    if [queueSize] and [queueSize] < 644245 {
        pipeline {
            ensure_delivery => true
            send_to => [cloudConnect]
        }
    }else{
#         Used to view the dropped events.
#         stdout {
#            codec => "rubydebug"
#        }
    }
}

SelAnt commented 2 years ago

Hi, Slightly improved version of aalleexxx5 solution - use Ruby to query Queue size every... 10 seconds (rather for every event). It will stop shipping events if 'sandbox' queue will have > 1000 events pipelines.yml

filter{
  ruby {
    init => "require 'uri';require 'net/http';require 'json';$sandbox_queue_size=-1;Thread.new do while true do uri = URI('http://localhost:9600/_node/stats/pipelines');res = Net::HTTP.get_response(uri);if res.is_a?(Net::HTTPSuccess) then json = JSON.parse(res.body); $sandbox_queue_size = json['pipelines']['sandbox'] ? json['pipelines']['sandbox']['queue']['events_count'] : -1; else $sandbox_queue_size = -1; end; if $sandbox_queue_size && $sandbox_queue_size<0 then puts 'Cannot get Sandbox queue size'; end; sleep 10 end end"
    code => 'event.set("sandbox_queue_size", $sandbox_queue_size)'
  }
}
output {
  if [sandbox_queue_size] and [sandbox_queue_size] >= 0 and [sandbox_queue_size] < 1000 {
    pipeline {
      send_to => [sandbox]
    }
  }
}

elastic / logstash

New queue type to configurably drop events to avoid upstream blocking from downstream back pressure #11601

How to empty a queue based on event count: