Open felixbarny opened 1 month ago
I am not too involved with ECS after its sync with otel, so this is more a personal preference than anything else.
I feel that option 1 is most feasible, that elastic package validation should be modified to allow multi-fields to be mapped to "at least" one of the multiple types defined in a multi field.
I do not know the impact on the work with syncing up ecs with otel, but even though option 1 is most feasible, I would have liked to see all .name to be changed to only keyword mapping types, though would be good to see if we take advantage of multi-fields on .name anywhere in our stack (UI, Siem/monitoring rule, prebuilt ml jobs etc).
I don't agree that the current state of ecs@mappings
makes it non ECS-compliant.
If you index a document that has the user_agent.name
field and you are familiar with ECS, you expect to have this field mapped to a keyword
. Having an additional field - user_agent.name.text
doesn't make it non ECS-compliant.
So IMHO, if there's any reason to consider option 2, it would be the storage savings. Since I don't know enough to say anything about the actual storage overhead in practice or the entire consequences of changing ECS, I don't have a good advice on which of the options is the best.
I don't agree that the current state of
ecs@mappings
makes it non ECS-compliant.
I think you can argue both ways. The issue is that there's no clear definition of whether or not having additional fields is ECS compliant, which leads to some assuming it is compliant and some it's not (like the elastic-package validation).
One valid outcome of this is that we're declaring having additional subfields is ECS compliant, which is option 1.
I don't have a strong opinion but I'm leaning towards either option 2 or 3, for this reason that I mentioned in the description:
whether or not a field should have a text subfield is a decision we should make at the ECS level rather than being a side-effect of how
ecs@mappings
is implemented.
Adding a text sub-field requires more storage but enables richer query capabilities. Whether or not that tradeoff is worth it should be decided in the process of adding a field to ECS. We may decide that all *.name
fields should have a text subfields, for example, to be consistent. But it should be a conscious decision rather than decided by an implementation detail.
I do not know the impact on the work with syncing up ecs with otel
This probably deserves a separate discussion. Bin in short, I think we should have a process that adds all (maybe only stable?) Semantic Convention fields to ECS, at which time we'll decide on the Elasticsearch field type. Ideally, this should be guided by naming conventions (like *.ip
-> ip
field type) as much as possible. Maybe we should also introduce the concept of more types (like ip and geo location) or type hints (still use string for compatibility reasons but add a hint for a more specific type) to OTLP. However, there will always be the need of defining Elasticsearch-specific field types, such as match_only_text
, semantic_text
, etc.
ECS would then become a superset of Semantic Conventions plus a mapping to Elasticsearch field types.
Having an additional field -
user_agent.name.text
doesn't make it non ECS-compliant.
Thinking about this again, I think an integration should still be considered ECS compliant if it adds additional sub-fields that aren't part of ECS but are relevant to that specific integration (like additional .caseless
fields). It does seem overly strict that elastic-package
would classify such integrations as non ECS compliant and failing the tests.
Having said that, I think the point still stands that whether or not a field should have a .text
subfield by default is a decision that should be made when adding the field to ECS.
I think it is not so much a question about ECS compliance, but about what to expect when using ecs@mappings
.
ecs@mappings
is in principle actually compliant with ECS, but by adding some convenient subfields, it is adding more "fields" to documents. So it is a matter of knowing what to expect when using ecs@mappings
. Does it add only ECS fields, or it adds more data beyond that? Where can we find a reference of the additional data it may be adding?
elastic-package
is neither testing ECS compliance. elastic-package
checks documents to verify that their fields have mappings and that they are documented in some place. When ecs@mappings
is used we are assuming that there may be fields that are not defined or documented in the package itself, but that they are defined and documented as part of ECS. In that regard, when ecs@mappings
adds data beyond ECS, elastic-package cannot know if this field is expected or not, and the only solution seems to be to disable validations for multifields. I guess similar questions might appear in other cases.
Regarding the options, if we think that it is a good idea that all .name
fields have a .text
subfield, then I wonder why we wouldn't do option 3. If we have doubts on that, then I think ecs@mappings
shouldn't be adding these subfields, and we should do option 2. If we don't do neither 2 nor 3, I think we should have some reference docs with the expected differences (maybe we already have it?).
In the meantime, for packages and elastic-package
, I guess we only have two options: package developers need to add mappings for the fields with additional subfields that their packages use, or elastic-package
should disable or relax validations on subfields.
Having said that, I think the point still stands that whether or not a field should have a .text subfield by default is a decision that should be made when adding the field to ECS.
That is the point I keep thinking about. There are 3 different approaches: Index as little as possible, find the perfect balance or index everything. Historically we opted for index everything as that is how the system is fast. Find the perfect balance is not possible, because the same data is used for different use case (o11y, security) and even inside use cases, there are different scenarios. Index as little as possible has the downside, that some "default" queries might not be fast.
What if we would have ecs-light
as the default, and context would decide if there is also an ecs-heavy
template used with caseless, text etc. A user installs an integration in o11y, by default caseless is not used. Install the same integration in security, it is there. In o11y, a user can click a setting to also enable "heavy" if they want. It might even be a global setting so the same logic applies to all data.
There are fields that potentially always have .text
to, but it is a small subset for light
. Which one is used in which scenarios now is not a problem of ECS itself but the solutions and user can change it. What ECS needs to provide is the compontent templates / blocks to put together the different scenarios.
elastic-package cannot know if this field is expected or not, and the only solution seems to be to disable validations for multifields. In the meantime, for packages and
elastic-package
, I guess we only have two options: package developers need to add mappings for the fields with additional subfields that their packages use, orelastic-package
should disable or relax validations on subfields.
I don't think we need to entirely disable the validation on multi-fields. But I think it makes sense to be lenient when there are additional multi-fields that were not expected.
Index as little as possible has the downside, that some "default" queries might not be fast.
In this case, it's not really about whether or not to index or whether queries are fast or slow. It's what capabilities are supported in search (exact matches with keyword vs match queries with match_only_text). But I think I see where you're going with that in general.
Some kind of solution or use-case specific extension to ECS does make sense to me. Ideally, that would also be based around consistent naming conventions. Not sure if it's a choice the user needs to make, though. For certain workflows or curated experiences, we may have expectations around how a field gets indexed. In that case, we should enable the additional mappings by default.
Proposal on a way forward:
ecs@mappings
to only add .text
subfields, as specified by ECS. Rationale is that whether the storage overhead of these sub-fields are worth it should be a decision that's made in ECS..text
subfields to all *.name
fields. This may lead to a bit of back-and-forth. But I think that's better compared to a discrepancy between ecs@mappings
ECS that's not resolved for a long time.Not sure whether we should change something in the behavior of the elastic-package
validation. On the one hand, it seems quite strict. But I now get that it's not validating ECS compliance but whether or not a field is documented. The way it does that is getting the mappings, removing all fields that are defined by ECS and in the package. If there are any fields left, it fails the validation. Do I get that right, @jsoriano?
But if that's true, aren't the validations for these packages failing regardless of whether or not LogsDB/synthetic _source is used? Or is it that we're validating on fields
when synthetic source is used and use _source
when not? If that's the case, why does validating on _source
not work with synthetic _source?
Proposal SGTM.
Not sure whether we should change something in the behavior of the
elastic-package
validation. On the one hand, it seems quite strict.
I think we need to change the behavior at least to support versions of ecs@mappings
that have been already released with the additional multifields (from 8.13 to 8.15). https://github.com/elastic/elastic-package/pull/2035 should be enough in that regard.
The way it does that is getting the mappings, removing all fields that are defined by ECS and in the package. If there are any fields left, it fails the validation. Do I get that right, @jsoriano?
Yeah, it is more or less like that. But it doesn't remove any field, it checks that any field in a document has a definition in the package itself or in ECS. It also checks that the values of the fields match with the defined type.
But if that's true, aren't the validations for these packages failing regardless of whether or not LogsDB/synthetic _source is used? Or is it that we're validating on
fields
when synthetic source is used and use_source
when not? If that's the case, why does validating on_source
not work with synthetic _source?
Yes, elastic-package uses fields
in some cases where there are fields that are not in the source, as is the case of synthetic source. And the discrepancies in test results come from there, the sets of fields are sometimes different when checking fields
or when checking _source
.
Historically (since the times of beats) we have been relying on _source
because it used to be enough, but it is not anymore with the many options we have now to control what is stored and how. We have to work on solving these discrepancies. https://github.com/elastic/elastic-package/issues/2016 goes in this line.
We'll open an issue in ECS to discuss adding .text subfields to all *.name fields. This may lead to a bit of back-and-forth. But I think that's better compared to a discrepancy between ecs@mappings ECS that's not resolved for a long time.
I just realized that such a thread already exists: https://github.com/elastic/ecs/issues/2118. However, there's no conclusion as of yet.
The
ecs@mappings
component template that ships with Elasticsearch by default has a dynamic template with path match on*.name
that adds a.text
subfield. However, in actual ECS, not all*.name
fields have a.text
subfield.The reasons why we still added the
*.name
rule in https://github.com/elastic/elasticsearch/pull/96171 included making the component template smaller, more consistent, more generic, and more forward compatible, in the sense that we don't need to constantly add new field definitions for new*.name
fields.However, in effect, the
ecs@mappings
component template isn't technically ECS compliant, which leads to issues likeelastic-package
reporting errors when integrations rely onecs@mappings
: https://github.com/elastic/elastic-package/issues/1971. It also seems like whether or not a field should have a text subfield is a decision we should make at the ECS level rather than being a side-effect of howecs@mappings
is implemented.In total, there are 150 ECS fields that end with
.name
. Out of these, 41 have a.text
sub-field and 109 don't.There are multiple options to move forward from here:
ecs@mappings
has a less efficient mapping compared to "proper ECS".ecs@mappings
with the current definition of ECS. We'll probably implement this by listing fields that have a.text
subfield as there are fewer of them. In other words,*.name
fields won't have a.text
subfield by default. We'll need to expect that changes toecs@mappings
are going to be a bit more frequent and that the mapping is less forwards compatible.ecs@mappings
to make sure all*.name
fields are mapped consistently. It'll be a bit easier for users to reason about what type of queries they can expect to work on*.name
fields. It would also bring us closer to a place where ECS is built around naming conventions rather than one-off per field decisions. However, this would add a bit more storage overhead compared to what we have today.cc'ing a couple of folks that may have thoughts on this: @ruflin @eyalkoren @jsoriano @zmoog @andrewkroh @P1llus