Resolve discrepancy between text subfield handling for `*.name` fields in `ecs@mappings`

elastic / ecs

Elastic Common Schema

https://www.elastic.co/what-is/ecs

Apache License 2.0

996 stars 413 forks source link

Resolve discrepancy between text subfield handling for `*.name` fields in `ecs@mappings` #2353

Open felixbarny opened 1 month ago

felixbarny commented 1 month ago

The ecs@mappings component template that ships with Elasticsearch by default has a dynamic template with path match on *.name that adds a .text subfield. However, in actual ECS, not all *.name fields have a .text subfield.

The reasons why we still added the *.name rule in https://github.com/elastic/elasticsearch/pull/96171 included making the component template smaller, more consistent, more generic, and more forward compatible, in the sense that we don't need to constantly add new field definitions for new *.name fields.

However, in effect, the ecs@mappings component template isn't technically ECS compliant, which leads to issues like elastic-package reporting errors when integrations rely on ecs@mappings: https://github.com/elastic/elastic-package/issues/1971. It also seems like whether or not a field should have a text subfield is a decision we should make at the ECS level rather than being a side-effect of how ecs@mappings is implemented.

In total, there are 150 ECS fields that end with .name. Out of these, 41 have a .text sub-field and 109 don't.

There are multiple options to move forward from here:

Do nothing: define that having additional sub-fields is not a violation of ECS. We'll update the validation logic for elastic-package accordingly. However, ECS compatibility has less strong guarantees this way. It also means that ecs@mappings has a less efficient mapping compared to "proper ECS".
Align ecs@mappings with the current definition of ECS. We'll probably implement this by listing fields that have a .text subfield as there are fewer of them. In other words, *.name fields won't have a .text subfield by default. We'll need to expect that changes to ecs@mappings are going to be a bit more frequent and that the mapping is less forwards compatible.
Align ECS with ecs@mappings to make sure all *.name fields are mapped consistently. It'll be a bit easier for users to reason about what type of queries they can expect to work on *.name fields. It would also bring us closer to a place where ECS is built around naming conventions rather than one-off per field decisions. However, this would add a bit more storage overhead compared to what we have today.

cc'ing a couple of folks that may have thoughts on this: @ruflin @eyalkoren @jsoriano @zmoog @andrewkroh @P1llus

P1llus commented 1 month ago

I am not too involved with ECS after its sync with otel, so this is more a personal preference than anything else.

I feel that option 1 is most feasible, that elastic package validation should be modified to allow multi-fields to be mapped to "at least" one of the multiple types defined in a multi field.

I do not know the impact on the work with syncing up ecs with otel, but even though option 1 is most feasible, I would have liked to see all .name to be changed to only keyword mapping types, though would be good to see if we take advantage of multi-fields on .name anywhere in our stack (UI, Siem/monitoring rule, prebuilt ml jobs etc).

eyalkoren commented 1 month ago

I don't agree that the current state of ecs@mappings makes it non ECS-compliant. If you index a document that has the user_agent.name field and you are familiar with ECS, you expect to have this field mapped to a keyword. Having an additional field - user_agent.name.text doesn't make it non ECS-compliant.

So IMHO, if there's any reason to consider option 2, it would be the storage savings. Since I don't know enough to say anything about the actual storage overhead in practice or the entire consequences of changing ECS, I don't have a good advice on which of the options is the best.

felixbarny commented 1 month ago

I don't agree that the current state of ecs@mappings makes it non ECS-compliant.

I think you can argue both ways. The issue is that there's no clear definition of whether or not having additional fields is ECS compliant, which leads to some assuming it is compliant and some it's not (like the elastic-package validation).

One valid outcome of this is that we're declaring having additional subfields is ECS compliant, which is option 1.

I don't have a strong opinion but I'm leaning towards either option 2 or 3, for this reason that I mentioned in the description:

whether or not a field should have a text subfield is a decision we should make at the ECS level rather than being a side-effect of how ecs@mappings is implemented.

Adding a text sub-field requires more storage but enables richer query capabilities. Whether or not that tradeoff is worth it should be decided in the process of adding a field to ECS. We may decide that all *.name fields should have a text subfields, for example, to be consistent. But it should be a conscious decision rather than decided by an implementation detail.

felixbarny commented 1 month ago

I do not know the impact on the work with syncing up ecs with otel

This probably deserves a separate discussion. Bin in short, I think we should have a process that adds all (maybe only stable?) Semantic Convention fields to ECS, at which time we'll decide on the Elasticsearch field type. Ideally, this should be guided by naming conventions (like *.ip -> ip field type) as much as possible. Maybe we should also introduce the concept of more types (like ip and geo location) or type hints (still use string for compatibility reasons but add a hint for a more specific type) to OTLP. However, there will always be the need of defining Elasticsearch-specific field types, such as match_only_text, semantic_text, etc.

ECS would then become a superset of Semantic Conventions plus a mapping to Elasticsearch field types.

felixbarny commented 1 month ago

Having an additional field - user_agent.name.text doesn't make it non ECS-compliant.

Thinking about this again, I think an integration should still be considered ECS compliant if it adds additional sub-fields that aren't part of ECS but are relevant to that specific integration (like additional .caseless fields). It does seem overly strict that elastic-package would classify such integrations as non ECS compliant and failing the tests.

Having said that, I think the point still stands that whether or not a field should have a .text subfield by default is a decision that should be made when adding the field to ECS.

jsoriano commented 1 month ago

I think it is not so much a question about ECS compliance, but about what to expect when using ecs@mappings.

ecs@mappings is in principle actually compliant with ECS, but by adding some convenient subfields, it is adding more "fields" to documents. So it is a matter of knowing what to expect when using ecs@mappings. Does it add only ECS fields, or it adds more data beyond that? Where can we find a reference of the additional data it may be adding?

elastic-package is neither testing ECS compliance. elastic-package checks documents to verify that their fields have mappings and that they are documented in some place. When ecs@mappings is used we are assuming that there may be fields that are not defined or documented in the package itself, but that they are defined and documented as part of ECS. In that regard, when ecs@mappings adds data beyond ECS, elastic-package cannot know if this field is expected or not, and the only solution seems to be to disable validations for multifields. I guess similar questions might appear in other cases.

Regarding the options, if we think that it is a good idea that all .name fields have a .text subfield, then I wonder why we wouldn't do option 3. If we have doubts on that, then I think ecs@mappings shouldn't be adding these subfields, and we should do option 2. If we don't do neither 2 nor 3, I think we should have some reference docs with the expected differences (maybe we already have it?).

In the meantime, for packages and elastic-package, I guess we only have two options: package developers need to add mappings for the fields with additional subfields that their packages use, or elastic-package should disable or relax validations on subfields.

ruflin commented 1 month ago

Having said that, I think the point still stands that whether or not a field should have a .text subfield by default is a decision that should be made when adding the field to ECS.

That is the point I keep thinking about. There are 3 different approaches: Index as little as possible, find the perfect balance or index everything. Historically we opted for index everything as that is how the system is fast. Find the perfect balance is not possible, because the same data is used for different use case (o11y, security) and even inside use cases, there are different scenarios. Index as little as possible has the downside, that some "default" queries might not be fast.

What if we would have ecs-light as the default, and context would decide if there is also an ecs-heavy template used with caseless, text etc. A user installs an integration in o11y, by default caseless is not used. Install the same integration in security, it is there. In o11y, a user can click a setting to also enable "heavy" if they want. It might even be a global setting so the same logic applies to all data.

There are fields that potentially always have .text to, but it is a small subset for light. Which one is used in which scenarios now is not a problem of ECS itself but the solutions and user can change it. What ECS needs to provide is the compontent templates / blocks to put together the different scenarios.

felixbarny commented 1 month ago

elastic-package cannot know if this field is expected or not, and the only solution seems to be to disable validations for multifields. In the meantime, for packages and elastic-package, I guess we only have two options: package developers need to add mappings for the fields with additional subfields that their packages use, or elastic-package should disable or relax validations on subfields.

I don't think we need to entirely disable the validation on multi-fields. But I think it makes sense to be lenient when there are additional multi-fields that were not expected.

Index as little as possible has the downside, that some "default" queries might not be fast.

In this case, it's not really about whether or not to index or whether queries are fast or slow. It's what capabilities are supported in search (exact matches with keyword vs match queries with match_only_text). But I think I see where you're going with that in general.

Some kind of solution or use-case specific extension to ECS does make sense to me. Ideally, that would also be based around consistent naming conventions. Not sure if it's a choice the user needs to make, though. For certain workflows or curated experiences, we may have expectations around how a field gets indexed. In that case, we should enable the additional mappings by default.

felixbarny commented 3 weeks ago

Proposal on a way forward:

We're modifying ecs@mappings to only add .text subfields, as specified by ECS. Rationale is that whether the storage overhead of these sub-fields are worth it should be a decision that's made in ECS.
We'll open an issue in ECS to discuss adding .text subfields to all *.name fields. This may lead to a bit of back-and-forth. But I think that's better compared to a discrepancy between ecs@mappings ECS that's not resolved for a long time.

Not sure whether we should change something in the behavior of the elastic-package validation. On the one hand, it seems quite strict. But I now get that it's not validating ECS compliance but whether or not a field is documented. The way it does that is getting the mappings, removing all fields that are defined by ECS and in the package. If there are any fields left, it fails the validation. Do I get that right, @jsoriano? But if that's true, aren't the validations for these packages failing regardless of whether or not LogsDB/synthetic _source is used? Or is it that we're validating on fields when synthetic source is used and use _source when not? If that's the case, why does validating on _source not work with synthetic _source?

jsoriano commented 3 weeks ago

Proposal SGTM.

Not sure whether we should change something in the behavior of the elastic-package validation. On the one hand, it seems quite strict.

I think we need to change the behavior at least to support versions of ecs@mappings that have been already released with the additional multifields (from 8.13 to 8.15). https://github.com/elastic/elastic-package/pull/2035 should be enough in that regard.

The way it does that is getting the mappings, removing all fields that are defined by ECS and in the package. If there are any fields left, it fails the validation. Do I get that right, @jsoriano?

Yeah, it is more or less like that. But it doesn't remove any field, it checks that any field in a document has a definition in the package itself or in ECS. It also checks that the values of the fields match with the defined type.

But if that's true, aren't the validations for these packages failing regardless of whether or not LogsDB/synthetic _source is used? Or is it that we're validating on fields when synthetic source is used and use _source when not? If that's the case, why does validating on _source not work with synthetic _source?

Yes, elastic-package uses fields in some cases where there are fields that are not in the source, as is the case of synthetic source. And the discrepancies in test results come from there, the sets of fields are sometimes different when checking fields or when checking _source.

Historically (since the times of beats) we have been relying on _source because it used to be enough, but it is not anymore with the many options we have now to control what is stored and how. We have to work on solving these discrepancies. https://github.com/elastic/elastic-package/issues/2016 goes in this line.

felixbarny commented 2 weeks ago

We'll open an issue in ECS to discuss adding .text subfields to all *.name fields. This may lead to a bit of back-and-forth. But I think that's better compared to a discrepancy between ecs@mappings ECS that's not resolved for a long time.

I just realized that such a thread already exists: https://github.com/elastic/ecs/issues/2118. However, there's no conclusion as of yet.