Open herrBez opened 6 months ago
Thanks for the detailed write up @herrBez . For me the target is, that all data going into logs-*-*
will have the correct default mappings, no matter if it is an integration or not. As you describe above, there are 2 problems:
Having all .name
fields a s a multi field with keyword / text came in as there were more and more scenarios where both were needed and we wanted to simplify things. For users that don't use it, it is a waist of resources but if you do, it is very powerful as it provides you full text search.
Few thoughts:
The k8s fields are not aligned, as ECS does not have k8s fields but orchestrator fields. Also the orchestrator fields do not cover the full list of fields from k8s (fields can be found here). That we end up with different mappings for these fields is not acceptable.
It seems the list of fields that causes issues is a very limited set of fields and we need to pick one of the two behaviours. Is there a use of non keyword fields for labels and annotations? If not, keyword seems like the right default.
There is also a problem in data views in Kibana. Even though in the above scenario, different mappings should not exist, there are cases where in all logs logs-*-*
it is valid to have different mapping types for the same field and not an error. For Elasticsearch it is not an error so it shouldn't be one for Kibana either. This issue has been raised here: https://github.com/elastic/kibana/issues/130242
Side note: The dedotting is not required anymore as we have introduced subobject false. This is a very recent addition.
The k8s fields are not aligned, as ECS does not have k8s fields but orchestrator fields. Also the orchestrator fields do not cover the full list of fields from k8s (fields can be found here). That we end up with different mappings for these fields is not acceptable.
I never noticed it but the change is pretty huge 😅 . Having labels and annotations in the form of key:value
instead of a separate fields is pretty different. (Side note: I don't understand why we are using singular orchestrator.label
instead of plural orchestrator.labels
). Further, it is different from how we manage container.labels
(or in general labels) as of now.
I kind of like the idea of the orchestrator field because:
kubectl
. For example, for kubectl get pod -L foo=bar
, we could write orchestrator.label: "foo:bar"
instead of kuberntes.labels.foo: bar
, which feels more k8s-like (but less Kibana like).It seems the list of fields that causes issues is a very limited set of fields and we need to pick one of the two behaviours. Is there a use of non keyword fields for labels and annotations? If not, keyword seems like the right default.
TLDR: I agree, keyword is the right call!
Long Answer: According to Kubernetes Documentation annotations value needs to strings. For labels we don't have a clear warning like for annotations, but I feel like they should be string as well. This issue seems to confirm it: https://github.com/kubernetes/kubernetes/issues/57509. Trying to create an object with a non-string labels will fails. Even if it would be allowed, labels are user-selectors and it may also happen that different teams would potentially use different types for the same labels. Opening space for additional problems. So, I sing keyword in any case is the right thing to do for the fields (and it is also what we are doing now with the kubernetes integration).
There is also a problem in data views in Kibana. Even though in the above scenario, different mappings should not exist, there are cases where in all logs logs-- it is valid to have different mapping types for the same field and not an error. For Elasticsearch it is not an error so it shouldn't be one for Kibana either. This issue has been raised here: https://github.com/elastic/kibana/issues/130242
While I appreciate the fact that having more precise datasets will avoid this kind of issues, I somehow feel it's kind of a betrayal of the original promise of ECS: be allowed to correlate different sources
Side note: The dedotting is not required anymore as we have introduced subobject false. This is a very recent addition.
Is it also in-use by the different integrations? Last time I checked it was not enabled
+1 to the orchestrator fields.
About
Change UI: Users are using it today, because the UI auto suggest it and it looks like it is the one you should use. If a user would just write an Elasticsearch query without auto complete, most users I assume would not add it because they don't know it is there in the first place. Would adjustments to the UI behaviour help here?
While possible, this doesn't seem like a good approach to me - if the .text
field is there, it is actually quite helpful to query. If there isn't a use case, let's fix it on the mapping level.
While I appreciate the fact that having more precise datasets will avoid this kind of issues, I somehow feel it's kind of a betrayal of the original promise of ECS: be allowed to correlate different sources
I'm talking here about fields which are not part of ECS. ECS is the foundation but users always have their own fields. It is ok if these are different in different datasets. Kibana should only complain, if the ECS fields are not the same.
if the .text field is there, it is actually quite helpful to query.
I wonder if most users know the difference between querying / filtering on a text vs keyword field? In quite a few cases, both are needed and we should support users in making the right choice.
What are our next steps here? I expect the kubernetes fields problem to stick around for a long time even if we recommend moving to orchestrator fields. One idea: Have a special mapping for labels to ensure these are ALWAYS keywords (and only keywords) in the ecs template?
What are our next steps here? I expect the kubernetes fields problem to stick around for a long time even if we recommend moving to orchestrator fields. One idea: Have a special mapping for labels to ensure these are ALWAYS keywords (and only keywords) in the ecs template?
You mean keyword-only for *.annotations
? That would still leave the question of what to do with pod.name
- switching *.name
to keyword-only is probably going too far.
Another idea would be to always add logs@custom
to the stack of component templates. This way, the user could specify these mappings in a central place, which doesn't completely automate the case, but at least makes it easier to manage.
You mean keyword-only for *.annotations?
Pick the fields where it is required to have keyword only and ensure either flattened or not flattened is aligned with kubernetes integrations or swap it.
For .text
, I would not change it but rather think how users could opt out or as described above, could change the UI slightly.
About logs@custom, I assume it is currently there for all non integration datasets but for integrations, it is not included? I agree, it should be there not only for this case.
About logs@custom, I assume it is currently there for all non integration datasets but for integrations, it is not included? I agree, it should be there not only for this case.
It is in the main logs template (non-integration logs streams in DSNS), but not for integrations:
Should be a simple change though
@kpollich ^
I assume priority wise it would sit after the @package
template?
I assume priority wise it would sit after the @package template?
This makes sense to me, the new composed_of
order of component templates for logs data streams would then be
logs@mappings
logs@settings
logs-{integration.name}.{dataset}@package
logs@custom
logs-{integration.name}.{dataset}@custom
ecs@mappings
.fleet-global-1
.fleet_agent_id_verification-1
General methodology here as I understand it is "stack-defined mappings" -> "package defined mappings" -> "user defined mappings" -> "ecs mappings" -> "fleet metadata". Am I understanding the intent correctly here?
the twist around ecs-mappings
is that these are all dynamic. So if a more specific template existed before, it will still win (at least that is my understanding).
I took a look at the code in Kibana to see if this would be a quick change, and it's a little more involved than I initially thought. We need to do a bit of a refactor because the @package
and @custom
templates are coupled, so inserting another template between them requires some reworking of the code.
Problem Statement
It is not uncommon to read logs from Kubernetes and send them into different integrations (e.g., if I am reading the data from an Nginx Pod I want to use the NGINX integration etc.). To accomplish this goal we have at least two alternatives:
Autodiscoery works perfectly fine for Filebeat, where all kubernetes metadata fields are specified in the template, in the integration case it can lead to unwanted consequences, i.e., creation of unexpected fields and potentially mapping differences between (kubernetes.container_logs and other integration datasets).
Indeed, since the kubernetes dynamic template rules and static fields are not defined, we may end up in having different mappings in different datastreams. A couple of examples:
kubernetes.pod.name
: it has always been a keyword. However, if the data is rerouted to another integration it will result in the creation ofkubernetes.pod.name.text
field because of the dynamic template ruleecs_path_match_keyword_and_match_only_text
The
kubernetes.container_logs
integration containskubernetes.node.labels
,kubernetes.node.annotations
,kubernetes.namespace_labels
,kubernetes.labels
,kubernetes.namespace_annotations
andkubernetes.selectors
dynamic template rules that match every type and map them as keyword (note that nested fields likekuberntes.labels.test.foo
are not accepted. The fields are dedoted by Beats https://www.elastic.co/guide/en/fleet/current/kubernetes-provider.html).Potential Impact
Creating unnecessary fields is a potential waste of resources and can slow down ingestion
If users starts accidentally use the field
kubernetes.pod.name.text
to define their alerts or to query data they may unwillingly filter out the kubernetes.container_logsIf for some reasons we have a label of type integer (not sure it can actually happen) we may have different mappings between NGINX and Kubernetes integration. Labels are used as identifier for Kubernetes and can be used to group together all logs for a given application. Having different mappings can make a field not usable in Kibana or not viable for certain types of queries.
This will be mapped as keyword
POST logs-kubernetes.container_logs-testissue/_doc?op_type=create { "@timestamp": "2024-01-01", "kubernetes": { "annotations": { "foo": 34 } } }
kubernetes.annotations.foo will be indexed as an integer
POST logs-nginx.access-testissue/_doc?op_type=create { "@timestamp": "2024-01-01", "kubernetes": { "annotations": { "foo": { "bar": 34 } } } }
kubernetes.annotations.foo will not be indexed (the document will be rejected in 8.13.3)
POST logs-kubernetes.container_logs-testissue/_doc?op_type=create { "@timestamp": "2024-01-01", "kubernetes": { "annotations": { "foo": { "bar": 34 } } } }