elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.28k stars 3.51k forks source link

[Monitoring] Split filters lead to misleading pipeline event out count #12746

Open lucabelluccini opened 3 years ago

lucabelluccini commented 3 years ago

Bug

Global pipeline in/out event count stats are wrong when the split filter is involved.

To recap:

I decided to file it on Logstash as the split plugin reports the correct in/out event count stats.

Affected versions

Both 6.x and 7.x, with different behaviors.

Reproducer

$ echo 'input{ file{ path=>"/tmp/test.json" start_position=>beginning codec=>json } } filter{ split{ field=>"split_id" } } output { stdout {} }' > /etc/split_test.conf
$ echo '{"message":"test","split_id":[1,2]}' > /tmp/test.json
$ ./logstash -f /tmp/split_test.conf --log.level=debug

Result

Behavior on 6.8.8:

  "version" : "6.8.8",
...
  "pipelines" : {
    "main" : {
      "events" : {
        "queue_push_duration_in_millis" : 0,
        "out" : 3,
        "filtered" : 3,
        "in" : 1,
        "duration_in_millis" : 89
      },
      "plugins" : {
        "inputs" : [ {
          "id" : "10450aa3176842c1a2e7724e798379cd6edd83e8d0122798d7abeff870d66075",
          "name" : "file",
          "events" : {
            "queue_push_duration_in_millis" : 0,
            "out" : 1
          }
        } ],
        "filters" : [ {
          "id" : "6a8419f80370e2264f44de0d2fc603fe5173d581069079d991228568a490aa82",
          "name" : "split",
          "events" : {
            "out" : 2,
            "in" : 1,
            "duration_in_millis" : 14
          }
        } ],
        "outputs" : [ {
          "id" : "fdbbe97b7145e6aff04b953b8e7b491d91b8058089fcc223ca2280e181665cbd",
          "name" : "stdout",
          "events" : {
            "out" : 2,
            "in" : 2,
            "duration_in_millis" : 67
          }
        } ]
      },

On 7.8.1 (also on 7.11):

  "version" : "7.8.1",
 ...
  "pipelines" : {
    "main" : {
      "events" : {
        "duration_in_millis" : 78,
        "filtered" : 1,
        "out" : 1,
        "queue_push_duration_in_millis" : 0,
        "in" : 1
      },
      "plugins" : {
        "inputs" : [ {
          "id" : "039684c7658a0eab764c514e5518d938833ce316571e6f65ec0ac8e430a4f61f",
          "events" : {
            "out" : 1,
            "queue_push_duration_in_millis" : 0
          },
          "name" : "file"
        } ],
        "codecs" : [ {
          "id" : "json_7cbf8c0e-a623-4c8a-8218-75380fe0c11e",
          "encode" : {
            "duration_in_millis" : 0,
            "writes_in" : 0
          },
          "name" : "json",
          "decode" : {
            "duration_in_millis" : 0,
            "out" : 0,
            "writes_in" : 0
          }
        }, {
          "id" : "rubydebug_666758b6-1c6b-42f6-9412-aabc3ffa568a",
          "encode" : {
            "duration_in_millis" : 66,
            "writes_in" : 2
          },
          "name" : "rubydebug",
          "decode" : {
            "duration_in_millis" : 0,
            "out" : 0,
            "writes_in" : 0
          }
        } ],
        "filters" : [ {
          "id" : "60c63f53dd63c8ae590f214ac4882eae753766d259d33aa10981c13345f4cb09",
          "events" : {
            "duration_in_millis" : 17,
            "out" : 2,
            "in" : 1
          },
          "name" : "split"
        } ],
        "outputs" : [ {
          "id" : "b1a176e5a63533c0b6675c591a4202fd808f87d76da3f3116a2d01718abb7b40",
          "events" : {
            "duration_in_millis" : 74,
            "out" : 2,
            "in" : 2
          },
          "name" : "stdout"
        } ]
      },

On 6.x: main.events.out = 3 On 7.x: main.events.out = 1

yaauie commented 3 years ago

🤔 This appears to be a difference between the legacy Ruby Execution Engine (default in 6.x; in 7.x use --java-execution false) and the modern Java Execution Engine (default in 7.x; in 6.x use --experimental-java-execution true).

In the docs, we direct people to determine lag in their pipelines by subtracting events.out from events.in. For pipelines with memory queues this will never be more than pipeline.workers * queue.batch.size, but with persistent queue this number can be exceptionally valuable at determining whether a pipeline is generally keeping up or falling behind.

In simple pipelines, these two numbers can be equivalent, but in complex pipelines that add or cancel events, or conditionally route those events to outputs, I believe that the modern JEE implementation provides a more actionable metric. We do provide concrete numbers for each of the pipeline outputs to indicate how many non-cancelled events it has completed so if a user is looking for that data, it is available.

I'm not sure what the best solution is here.