Optimize the parsing of Elasticsearch gc logs

joegallo commented 1 year ago

The grok processor that makes up the meat of the "Pipeline for parsing Elasticsearch JVM garbage collection logs" is a bit expensive.

Here's an example from a cluster I ran into that uses a version of that pipeline (~1 billion documents processed over the last two weeks).

Top Processors:
===============
                            index       count  time_in_millis  time_in_nanos  percent
pipeline         processor
some-gc-pipeline grok           1  1154824594       102744728          88970    71.7%
some-gc-pipeline date           3  1154823898        13141368          11380     9.2%
some-gc-pipeline set            0  1154824601         5974460           5174     4.2%
[...]

71.7% of the time that this cluster spends in any processor at all is spent in that single processor. (Which is a nice headline, but it's not like this cluster does all that much with ingest pipelines besides process gc logs, so I don't want it to be misinterpreted.)

Anyway, this seems like something somebody could choose to reimplement with higher performance in mind. Perhaps some combination of dissect and/or script processors could accomplish the same goal without the performance cost?

/cc @elastic/infra-monitoring-ui due to CODEOWNERS /cc @elastic/cloud-observability because this might be relevant to your interests

miltonhultgren commented 1 year ago

@joegallo Can you share some more information about how you produced those performance numbers?

Some quick thoughts: The current Grok processor has 6 patterns (some of which have a sub pattern for the JVM version used), though perhaps some are possible to collapse. It might help to split the pipeline into two pipelines, one for JVM 8 and one for JVM 9.

One way forward could be to replace each pattern with a conditional dissect processor, each dissect pattern would likely be simpler and faster but the challenge will lay in finding the right condition without making the cost of the cascading conditions more expensive than the original Grok pattern.

If the first processor is the one for the most commonly shaped log line, we could make the dissect write to a special marking field. The next dissect processor can then first check if that field is already filled out and hopefully short circuit the rest of the condition to fall through (the same would apply for the following patterns). In that way, we'd only take the hit of evaluating all the more costly conditions in the case of more rare shapes of the log lines.

There is a catch all pattern at the end in the Grok processor, so we'd have to make sure the dissect multi processor path also has one like that at the end.

felixbarny commented 1 year ago

One way forward could be to replace each pattern with a conditional dissect processor, each dissect pattern would likely be simpler and faster but the challenge will lay in finding the right condition without making the cost of the cascading conditions more expensive than the original Grok pattern.

If all logs start with the same prefix, you could also to a two stage process of dissecting the prefix first and then have conditional dissect processors for each variant to parse the rest.

If adding support for having multiple dissect expressions would help here, I'm sure that this wouldn't be too hard to add.

@joegallo To replace grok expressions like \[%{DATA:log.level}%{SPACE}\] with a dissect expression like [%{log.level->}], it would also be useful if the dissect processor had a trim option. For reference, the dissect processor in Filebeat has that option.

joegallo commented 1 year ago

@joegallo Can you share some more information about how you produced those performance numbers?

It's just the numbers from GET _nodes/stats for the cluster in question (link). For each processor in every pipeline running on every node, we report the number of invocations for that processor and the total time spent (reported in milliseconds). I just ran the math to summarize it across the cluster (total invocations and total time spent in this one processor on any node at all) and then compared it with the total time spent for all processors on all nodes.

mohibrahmani commented 1 year ago

Hi All, This issue was raised as an important case for our very strategic Customer who is an Executive Escalation for the past 4 months, dealing with a number of issues surrounding performance for their Stratosphere project. While on the call CC:Joe G heard first hand the level of performance issues they are seeing, and after the Node/Stats that was shared with Joe. Would like to request prioritization if possible for this case and a lax roadmap plan that could be communicated to the Customer. Appreciate the partnership.

tylerperk commented 1 year ago

@joegallo To replace grok expressions like \[%{DATA:log.level}%{SPACE}\] with a dissect expression like [%{log.level->}], it would also be useful if the dissect processor had a trim option. For reference, the dissect processor in Filebeat has that option.

@felixbarny Do you know if there's an open issue for adding trim to the dissect ingest processor? I didn't find one.

felixbarny commented 1 year ago

I don't think there is. At least I'm not aware of an open issue for that.

elastic / integrations

Optimize the parsing of Elasticsearch gc logs #6173