Using ECS fields in Elasticsearch indices / data streams

ruflin commented 2 years ago

The number of ECS fields has grown over the last years and has reached a number which is beyond the 1024 limit for fields in Elasticsearch. This causes issues in several places. In addition, for example in beats we started to import all ECS fields by default which means many fields are defined in the templates and mappings of Beats which are likely never used. This issue is to discuss the problem itself in more detail and the potential solutions to it.

Some of the guidelines for a solution I have in mind:

Minimal mapping: The resulting mapping of indices / data streams should only contain the fields that are actually used. This keeps the mappings compact and ensures recommendations of fields only happens on the fields that actually exist.
We need to assume 100+ data streams using ECS mappings: A large number of data streams with these ECS mappings and templates will exist. We need to think of what this means for Elasticsearch, cluster state etc.
ECS compatible: A user should be able to index any fields as long as these do not conflict with ECS. For example host can only be an object and not a keyword.

Potential solutions

To kick things of, in the following I'm bringing up a potential list of solutions but I think all of them are not ideal.

Data streams with ECS enabled

A data stream has a setting where ecs: true is set. By setting this flag, by default data streams will have the mappings available for the ECS fields. No templates would have to be set but the fields could be extended by templates. How this would work exactly, I don't know :-)

Use dynamic templates

Instead of specifying the fields directly, dynamic templates are used. This has the benefit that the fields don't show up in the mappings until these are actually used. This ECS template could exist in Elasticsearch as a component template. I wonder what the effect of this would be if 100 template use this ECS template? @jpountz Does it mean it exist only once in the cluster state or 100 times?

Click to expand dynamic template example for `log.level`

``` DELETE /test?ignore_unavailable PUT /test { "mappings": { "dynamic_templates": [ { "log_level": { "path_match": "log.level", "mapping": { "type": "keyword" } } } ] } } ```

Use dynamic templates to match most of the fields

Most of ECS fields are keyword. Dynamic templates rules can be used to match the majority of the fields correctly similar to https://github.com/elastic/elasticsearch/blob/feature/apm-integration/x-pack/plugin/core/src/main/resources/data-streams-mappings.json This has the advantage that the mapping stays compact and only enforces mappings for fields which are likely to conflict like host or error. It has the downside that not all of ECS is enforced and someone can ingest fields which might conflict with ECS.

Define on ingest time

If I remember correct, Elasticsearch supports defining the type of a field during ingest time. If we are in control of the data creating and data shipper, the role of enforcing and shipping ECS could happen by the shipper. This might work for cases like alerts inside Kibana but does not work for use cases where we don't control the data.

Heavily use runtime fields in data views

Instead of specifying all the mappings, heavily rely on runtime fields for ECS. Instead of having ECS in the Elasticsearch mapping, have it as a checkbox or similar in data views. Like this users can query on all the ECS fields but we don't enforce it on ingest time.

Split up ECS in multiple layers

One of the core issues is that ECS keeps growing and very likely, this is not going to stop. In the early days of ECS we discussed having layers of ECS, something like, core, base, extended. There are just very few fields which everyone should be aware of, then there are base fields which are very common and then we have extended with multiple groups / use cases.

Having such a grouping would make it possible to for example only have core or base in our templates instead of ALL fields.

ruflin commented 2 years ago

@jpountz @felixbarny @andrewkroh @kobelb @Mpdreamz Please include anyone else that might have opinions here.

ruflin commented 2 years ago

I added one more option as potential solution called "Split up ECS in multiple layers"

jpountz commented 2 years ago

Does it mean it exist only once in the cluster state or 100 times?

It would exist 100 times.

At first sight the option I like the most is the first one of somehow packaging ECS with Elasticsearch so that Elasticsearch could optionally make dynamic mapping rules ECS-compliant. We'd naturally have ECS mappings in a single place this way without polluting the cluster state, and indices could always get the latest ECS conventions for newly introduced fields, even if they were created a while ago.

I wonder if it would also make it easier for integrations to not include optional fields, so that they wouldn't be suggested in Kibana even though these fields are populated in none of the documents, like we saw on the netflow integration.

felixbarny commented 2 years ago

That would be fantastic!

Would that be relatively easy or rather challenging to implement in Elasticsearch?

jpountz commented 2 years ago

Haha, you're asking too much of me, I don't know. :) I opened an issue on the Elasticsearch repository to start gathering feedback: https://github.com/elastic/elasticsearch/issues/85692.

ruflin commented 2 years ago

Having the magic in ES would be fantastic indeed. I expect it would solve all the various problems we have.

javanna commented 2 years ago

We have had a couple of round of discussions about the linked Elasticsearch issue. Some aspects around using dynamic templates have come up, similarly to how they were mentioned above. We had some additional questions to get a better understanding of the problem that we are solving and possible solutions. It may be quicker to have a high-bandwidth discussion and get everybody on the same page.

djptek commented 2 years ago

@javanna can you link me in on that one, OK @ebeahan?

javanna commented 2 years ago

We discussed the different options with @felixbarny and @ruflin today and we said that the primary goal should be to make it easier for users to use ECS mappings. That would be the number one reasons to integrate ECS within Elasticsearch and automatically apply its mappings to e.g. logs data.

We discussed how dynamic templates should help in mapping only fields that appear in documents, so that we can reduce the size of mappings.

The plan is to come up with a core set of fields that we would like to integrate within Elasticsearch, see how their mappings differ from the default dynamic mappings, and evaluate how to move forward with that limited set of fields to start with.

Mpdreamz commented 2 years ago

Is the end goal still to include all ECS fieldsets by default?

E.g a variation of the component templates defined here: https://github.com/elastic/ecs/tree/main/generated/elasticsearch/composable/component that sets up the dynamic templates for each fieldset.

When a datastream is set up it should be super easy for users to opt in to ECS e.g through an ecs: true flag. Alternatively an even better user experience would be a ecs: all for all fieldsets and an optin for more tailored usecases ecs: ['process', 'faas', 'cloud', 'service']

If we only do it for a certain fields we put cognitive load on users to diff between what Elasticsearch ships with and what the diff is with their use-case.

felixbarny commented 2 years ago

The question is whether we want to advocate for indexing all ECS fields by default. I suppose we don't necessarily want to do that and instead only index fields that are frequently used in searches and aggregations to save disk space and CPU cycles on ingest.

For fields that aren't indexed by default, I think that we don't need to enforce mappings or types as runtime fields are much more lenient when it comes to type mismatches.

Mpdreamz commented 2 years ago

I suppose we don't necessarily want to do that and instead only index fields that are frequently used in searches and aggregations to save disk space and CPU cycles on ingest.

The power of ECS is knowing I can query data from different producers and correlate them. I am not necessary advocating for indexing all fields but to create a sane default experience where deviations and mapping overrides are the exception and not the norm. If this means the component templates default to index: no for most fields beyond the fields we know we use in queries often I'd be all for it.

The question is whether we want to advocate for indexing all ECS fields by default.

It wouldn't be indexing/creating mappings for ALL ECS fields either though, only the ones the user/solution actually uses. This is already a massive improvement over today's guidelines around setting up ECS component template mappings. It would also simplify Elastic Package integration development as keeping ECS compatible fields in sync with the spec becomes a non issue.

For fields that aren't indexed by default, I think that we don't need to enforce mappings or types as runtime fields are much more lenient when it comes to type mismatches.

ECS are well understood fields where enforcing structure/type checks is a feature IMO not something we would look to smooth over with runtime fields.

felixbarny commented 2 years ago

ECS are well understood fields where enforcing structure/type checks is a feature IMO not something we would look to smooth over with runtime fields.

I would push back on that part. Example: someone ingests a document like "process.pid": "foo", while ECS specifies process.pid as a number. Surely, the doc should be marked as to contain malformed data. But the doc might be "mostly good" otherwise and we shouldn't reject the doc or prevent the good fields from being indexed in the requested data stream.

If we agree that this out to be the desired behavior, we only need explicitly map ECS fields that should be indexed by default.

It wouldn't be indexing/creating mappings for ALL ECS fields either though, only the ones the user/solution actually uses.

I think there's a difference between "a solution uses a field" vs "the field is used in searches or aggregations frequently". For example, while APM uses error.stacktrace, it might be a field we wouldn't want to be indexed by default. There's probably just a handful of fields that need to be indexed. These should be in the index template for <type>-*-*. For logs, I have currently identified these fields as a good start: https://github.com/elastic/elasticsearch/pull/88181/files#diff-0fba05a9236d9aa3866db886f0b6c24759be01a775023e1663b8f44f91951e62.

Mpdreamz commented 2 years ago

I would push back on that part. Example: someone ingests a document like "process.pid": "foo", while ECS specifies process.pid as a number. Surely, the doc should be marked as to contain malformed data. But the doc might be "mostly good" otherwise and we shouldn't reject the doc or prevent the good fields from being indexed in the requested data stream.

This is a slippery slope IMO. We should very clearly delineate responsibilities

Explicit target datastreams {type}-{dataset}-{namespace} should have explicit mappings.
- It should be dead simple to be ECS Compliant and opt in to all or some field sets.
- Maybe the mechanism that suppports opting into ECS fields can specify the default for index: true|false.
- This dataset should override process.pid mapping to be more lenient if data can be send as a string. Map the exceptions.
- Adding runtime fields here should be to get out of a bind.
Fallback datastreams {type}-generic-namespace should have a very minimal mapping and almost always accept documents.

felixbarny commented 2 years ago

I don't think we should have different mappings for the default/non default data stream. At least for logs, I would argue that both specific (such as logs-foo-bar) and default (logs-generic-default) data streams, should be lenient and accept any documents, even if a document has a field with a type mismatch. The default data stream logs-generic-default is just a regular data stream without special semantics, it just happens to be the default if no specific value is provided.

Other data stream types, such as traces might have a more strict schema and disallow dynamic fields. In this case, we'll always have to explicitly specify which (ECS) fields should be part of the mapping.

ruflin commented 2 years ago

I have opened https://github.com/elastic/elasticsearch/pull/89743 as a Draft PR to discuss potential default mappings in detail. Please have a look at the PR description but especially the comments for each fields.

elastic / ecs