elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.69k stars 24.66k forks source link

[Logs+] Enable JSON parsing for logs by default #96651

Open felixbarny opened 1 year ago

felixbarny commented 1 year ago

In https://github.com/elastic/elasticsearch/pull/96083, we've added a JSON parsing pipeline which is currently not enabled by default.

We didn't enable JSON parsing by default as it comes with a risk of mapping explosions and object/scalar conflicts.

This issue lists the blockers for us to be able to enable JSON parsing by default:

- [ ] https://github.com/elastic/elasticsearch/issues/96648
- [ ] https://github.com/elastic/elasticsearch/issues/88934
- [ ] https://github.com/elastic/logs-dev/issues/23
- [ ] #96235

Example scenario: Users are onboarding k8s logs in a single data stream. This data stream currently has a relatively fixed schema. But once we start parsing JSON out of the message field, we don't have control over the schema anymore. A single service may consume all fields in the field limit and impacts ingestion from other services.

While we could send the data from different containers to different data streams, we can't do that without the user opting into that (as changing the data stream would be a breaking change) and there may be performance regressions due to the fact that a lot more data streams are created. While routing to different data stets helps to limit the blast radius of mapping explosions, enabling JSON parsing by default is still risky as it can lead to document loss within a single service.

cc @eyalkoren

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)

dakrone commented 1 year ago

I think we should add another blocking task for this, which is to benchmark the performance impact of enabling JSON parsing by default for both non-JSON (should be negligible though we should double check) and JSON logs.

It may also be useful, if we plan to make use of this in many integrations, to establish a baseline in nightly benchmarks so that we can track its performance over time.

felixbarny commented 1 year ago

non-JSON (should be negligible though we should double check)

As the pipeline contains a startsWith('{') check, the overhead for non-JSON logs should really be negligible. However, as @jpountz noted, when comparing with a baseline that doesn't contain an ingest pipeline at all, the overhead is probably noticeable. That's because the ingested doc needs to be de-serialized to an IngestDocument and, after processing, serialized back to be transferred to the node that's responsible for indexing the doc (where it will need to be de-serialized again).

tonyghiani commented 1 month ago

@felixbarny what is the current status of this issue? I believe this is closely related to the structuring logs projects, we have a PoC in place that also includes detecting/recommending JSON parsing on datasets.

There is a quick demo of the PoC in this slide (to be presented this week) that shows what the detection looks like and let the user apply the JSON pipeline on demand.

Let me know if you think this might benefit this task too or if the main objective it's to still set the pipeline by default, which would remove the need for this detection/recommendation on the structuring flow.

felixbarny commented 1 month ago

I'm not currently working on this am I'm also not sure if it's a great idea to parse any JSON by default. Even with some of the improvements that have been introduced recently to lessen the impact of exceeding the field limit, ignoring malformed values, and avoiding object/scalar issues, creating a field mapper for each individual JSON field is not always a good idea. It creates overhead and while the impact of hitting the field limit is lowered, it can still prevent other, more useful fields in the same data stream to be mapped.

So the approach to detect and recommend JSON parsing makes sense to me.

In the context of OpenTelemetry, there's a similar question on how the structured body field should be mapped. So far, my conclusion was to map it to the flattened field type to avoid arbitrary fields from adding to the mapping overhead while still allowing to query on those fields. Having said that, there are some limitations with that field type, such as proper support in Kibana (https://github.com/elastic/kibana/issues/25820), and properly supporting numeric fields (https://github.com/elastic/elasticsearch/issues/61550).