slow performance pipeline

avdhoot commented 7 years ago

Problem description

on basis of hostname i am routing messages to stream. Attached 'default' stream to 'stream decider' pipeline. which have below rules. currently graylog processing 800msg/sec. Input/output buffer 0% and process buffer is 100%. if detach default stream from pipeline graylog processing 8000msg/sec avg.

Note: I do not see any error message in GS/ES logs.

rule "aws cloud trail logs"
when
  contains(to_string($message.source), "aws-cloudtrail")
then
  route_to_stream(name: "aws cloudtrail [somencprod]");
end

rule "logs from abc.xvz.com"
when
  regex(pattern: "((n|p|e)\\d+-c\\d+|(bg|cm|nw)\\d+)-abc.xvz.com", value: to_string($message.source)).matches == true
then
    route_to_stream(name: "some prod logs [abc.xvz.com]");
end

rule "logs from cstaging env"
when
  regex(pattern: "((n|p|e)\\d+-c\\d+|(bg|cm|nw)\\d+)-(eval.*|cstaging.*).xvz.com", value: to_string($message.source)).matches == true
then
    route_to_stream(name: "some cstaging logs [cstaging2, evals]");
end

Environment

Graylog Version: Graylog 2.1.0-beta.3+b817da4/docker
Pipeline Processor plugin version: 1.1.0-beta.3
Operating System: ubuntu 14.04

tokred commented 7 years ago

I am not a developer, so just some ideas irrespective of potential (performance) bugs:

I would suggest to run a simulation via "Simulate processing". After loading an example message, select "More results" - "Simulation trace" on the right side of "Simulation results". There you can see which processing steps consume how much time.

Maybe your regex pattern are inefficient. Example (without knowing your exact string format):

Original: ((n|p|e)\d+-c\d+|(bg|cm|nw)\d+)-(eval.*|cstaging.*).xvz.com
Improved: ^((n|p|e)\d+-c\d+|(bg|cm|nw)\d+)-(eval[^.]*|cstaging[^.]*)\.xvz\.com

Try your patterns on regex101.com where you can see the required execution steps. To improve regex performance, think about your input set and which criteria really "make a difference", i.e. enable to make a decision.

Another idea: After looking at the route_to_stream source code, I think that using stream IDs instead of names should be faster for larger numbers of streams.

Regards

kroepke commented 7 years ago

Yes, the stream name lookup isn't terribly efficient at the moment. We are aware of it, but time constraints prevented us addressing the performance issues in this release. As @tokred said, first thing I'd do is to optimize the regexes, and replace stream names with their IDs, as painful as that currently is.

In general though, the pipelines are currently interpreted, which in itself is slower and the stream router was heavily optimized so it will be faster for this kind of matching in 2.0 and 2.1.

For 2.2 we plan to aggressively address bugs and performance of the pipelines, so unless you run into stream rule conditions that cannot be expressed in the current stream router, I'd advise to avoid switching to pipeline rules for routing just yet.

That being said, we do appreciate all feedback on the rules, because they will become the underlying engine for computation over the next few releases.

Thanks!

avdhoot commented 7 years ago

Thanks @tokred @kroepke for quick reply. As suggested I changed below things.

use stream ID: using stream id instead of name increase msg processing from 800 to 1500msg/sec.
Rgex changes: after changing regx pattern I do not seen any visible performance improvement. But as suggested regex pattern taking less steps compared to original one in regex101.com test.

Below are the result from 'Simulate processing in stream'. in that 'Execute Rule' taking most of time.

1 μs
Starting message processing
22 μs
Message 36c51c55-6469-11e6-abcf-0242ac120002 running [Pipeline 'stream decider' (57ab34dc24aa9a0001431376)] for default stream
67 μs
Enter Stage 0
74 μs
Evaluate Rule 'aws cloud trail logs' (57ab384024aa9a000143170b) in Pipeline 'stream decider' (57ab34dc24aa9a0001431376)
83 μs
Evaluation not satisfied Rule 'aws cloud trail logs' (57ab384024aa9a000143170b) in Pipeline 'stream decider' (57ab34dc24aa9a0001431376)
86 μs
Evaluate Rule 'logs from cstaging env' (57ab5e7524aa9a0001433f3a) in Pipeline 'stream decider' (57ab34dc24aa9a0001431376)
97 μs
Evaluation not satisfied Rule 'logs from cstaging env' (57ab5e7524aa9a0001433f3a) in Pipeline 'stream decider' (57ab34dc24aa9a0001431376)
100 μs
Evaluate Rule 'logs from xyz.abc.com' (57ab5ad524aa9a0001433b67) in Pipeline 'stream decider' (57ab34dc24aa9a0001431376)
113 μs
Evaluation satisfied Rule 'logs from xyz.abc.com' (57ab5ad524aa9a0001433b67) in Pipeline 'stream decider' (57ab34dc24aa9a0001431376)
118 μs
Execute Rule 'logs from xyz.abc.com' (57ab5ad524aa9a0001433b67) in Pipeline 'stream decider' (57ab34dc24aa9a0001431376)
4,958 μs  <<<-------
Completed Stage 0 for Pipeline 'stream decider' (57ab34dc24aa9a0001431376), continuing to next stage
4,962 μs
Exit Stage 0
4,977 μs
Message 36c51c55-6469-11e6-abcf-0242ac120002 running [] for streams []
4,984 μs
Finished message processing

kroepke commented 7 years ago

@avdhoot Thanks for the update and the numbers!

This looks like loading the stream definition is still too expensive. I don't think we'll be able to get a fix into 2.1, but I'll take a look at it tomorrow anyway, maybe there's an easy win.

kroepke commented 7 years ago

For future reference, we either need a cache for the name -> stream mapping or a faster service that listens for changes to streams without going to the database.

liukd commented 7 years ago

I met a problem, the hope can help to solve it. I want to build a development environment but execute the following command, and I was not successful

git clone https://github.com/Graylog2/graylog2-server.git cd graylog2-server/graylog2-web-interface ln -s $YOURPLUGIN plugin/ npm install && npm start

gruselglatz commented 7 years ago

i've had some simmilar issues with filled up processing buffer in combination with pipelines and so on. But my real issue was that my jvm heap was way too small. After i changed it to 15GB I am processing 3k to 5k Messages through my pipeline.

My sysload is: load1: 1,32, load5: 1,45, load15: 1,56 Graylog is running very fast on 24 CPU cores at the moment and is using only 16GB to 18GB of Memory.

We also have seen that having a dedicated ES-Cluster is way more important than tuning Graylog itself. ES gets very Hardware heavy over the time.

Androider03 commented 7 years ago

Hi, I'm experiencing a very similar issue. Graylog is running on a 8 cores, 16 GB RAM machine. After a flood of about 500,000 messages, the process buffer is full (size=65536), while the input and output buffers are empty. Metrics associated to the process buffer:

outgoing messages about 10-20 events/second
incoming messages about 4-10 events/second (I stopped the source that sends logs since 2,5 hours ago) I increased JVM's minimum heap size to 3 GB and its maximum size to 4 GB. Elastic search has 6 GB of heap memory. Also, I checked all the pipelines I have and each one of them performs just a string matching on a single attribute (no regex). 4 cores are running at 100% but the logs are still getting processed too slowly.

Graylog2 / graylog-plugin-pipeline-processor

slow performance pipeline #87

Problem description

Environment