elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
220 stars 453 forks source link

[Discussion] Every (log?) integration should deal with Kubernetes fields like the Kubernetes integration (or vice-versa) #9808

Open herrBez opened 7 months ago

herrBez commented 7 months ago

🚩 In the following only the logs use-case have been considered. It is not excluded we may have similar issues with metrics.

Problem Statement

It is not uncommon to read logs from Kubernetes and send them into different integrations (e.g., if I am reading the data from an Nginx Pod I want to use the NGINX integration etc.). To accomplish this goal we have at least two alternatives:

Autodiscoery works perfectly fine for Filebeat, where all kubernetes metadata fields are specified in the template, in the integration case it can lead to unwanted consequences, i.e., creation of unexpected fields and potentially mapping differences between (kubernetes.container_logs and other integration datasets).

Indeed, since the kubernetes dynamic template rules and static fields are not defined, we may end up in having different mappings in different datastreams. A couple of examples:

  1. kubernetes.pod.name: it has always been a keyword. However, if the data is rerouted to another integration it will result in the creation of kubernetes.pod.name.text field because of the dynamic template rule ecs_path_match_keyword_and_match_only_text

  2. The kubernetes.container_logs integration contains kubernetes.node.labels, kubernetes.node.annotations, kubernetes.namespace_labels, kubernetes.labels, kubernetes.namespace_annotations and kubernetes.selectors dynamic template rules that match every type and map them as keyword (note that nested fields like kuberntes.labels.test.foo are not accepted. The fields are dedoted by Beats https://www.elastic.co/guide/en/fleet/current/kubernetes-provider.html).

Potential Impact

  1. Creating unnecessary fields is a potential waste of resources and can slow down ingestion

  2. If users starts accidentally use the field kubernetes.pod.name.text to define their alerts or to query data they may unwillingly filter out the kubernetes.container_logs

  3. If for some reasons we have a label of type integer (not sure it can actually happen) we may have different mappings between NGINX and Kubernetes integration. Labels are used as identifier for Kubernetes and can be used to group together all logs for a given application. Having different mappings can make a field not usable in Kibana or not viable for certain types of queries.

    
    # This will be mapped as long
    POST logs-nginx.access-testissue/_doc?op_type=create
    {
    "@timestamp": "2024-01-01",
    "kubernetes": {
    "annotations": {
      "foo": 34
    }
    }
    }

This will be mapped as keyword

POST logs-kubernetes.container_logs-testissue/_doc?op_type=create { "@timestamp": "2024-01-01", "kubernetes": { "annotations": { "foo": 34 } } }


4. If for some reasons (it should not happen if using beats) one of the labels contain a subfield it may lead to the fact that the nginx integration will accept the field, while the kubernetes integration not.
Example:

kubernetes.annotations.foo will be indexed as an integer

POST logs-nginx.access-testissue/_doc?op_type=create { "@timestamp": "2024-01-01", "kubernetes": { "annotations": { "foo": { "bar": 34 } } } }

kubernetes.annotations.foo will not be indexed (the document will be rejected in 8.13.3)

POST logs-kubernetes.container_logs-testissue/_doc?op_type=create { "@timestamp": "2024-01-01", "kubernetes": { "annotations": { "foo": { "bar": 34 } } } }



**How similar fields are treated elsewhere**

If we take the example of `container` and `cloud` metadata, these are defined in every single integration. Because in the end these are metadata to identify the source object sending data. 

**Potential Solutions Identified so far**

- Add the kubernetes dynamic template rules and core kubernetes fields like `kubernetes.pod.name`, `kubernetes.namespace` in the template of each and every integration like `container` and `cloud` metadata

- Rely everywhere on the ecs@mapping component template to avoid mismatches between kubernetes and the other datasets (and hopefully find a way to map the kubernetes.pod.name and similar fields to keyword). As of now the ecs@mapping component template would map also host.name as text and keyword.

- Add an `infra` component template containing all infrastructure fields and their dynamic template rules counterparts (e.g., `host`, `cloud`, `kubernetes`) 

- Somehow reuse/nest the `logs-kubernetes.container_logs` into another component template

**Current Workaround**

The current workaround would be to define in each and every integration `@custom` template the fields. The workaround is viable if the amount of integration is small. Ideally, having a `global@custom`, (and especially) `logs@custom`, `logs-dataset@custom` component templates as described https://github.com/elastic/kibana/issues/149484 , would make the workaround viable even when using many integrations. 

CC @ruflin, @philippkahr, @flash1293 . Please chime-in if I forgot something.
ruflin commented 7 months ago

Thanks for the detailed write up @herrBez . For me the target is, that all data going into logs-*-* will have the correct default mappings, no matter if it is an integration or not. As you describe above, there are 2 problems:

Keyword with keyword.text

Having all .name fields a s a multi field with keyword / text came in as there were more and more scenarios where both were needed and we wanted to simplify things. For users that don't use it, it is a waist of resources but if you do, it is very powerful as it provides you full text search.

Few thoughts:

Kubernetes fields

The k8s fields are not aligned, as ECS does not have k8s fields but orchestrator fields. Also the orchestrator fields do not cover the full list of fields from k8s (fields can be found here). That we end up with different mappings for these fields is not acceptable.

It seems the list of fields that causes issues is a very limited set of fields and we need to pick one of the two behaviours. Is there a use of non keyword fields for labels and annotations? If not, keyword seems like the right default.

There is also a problem in data views in Kibana. Even though in the above scenario, different mappings should not exist, there are cases where in all logs logs-*-* it is valid to have different mapping types for the same field and not an error. For Elasticsearch it is not an error so it shouldn't be one for Kibana either. This issue has been raised here: https://github.com/elastic/kibana/issues/130242

Side note: The dedotting is not required anymore as we have introduced subobject false. This is a very recent addition.

herrBez commented 7 months ago

The k8s fields are not aligned, as ECS does not have k8s fields but orchestrator fields. Also the orchestrator fields do not cover the full list of fields from k8s (fields can be found here). That we end up with different mappings for these fields is not acceptable.

I never noticed it but the change is pretty huge 😅 . Having labels and annotations in the form of key:value instead of a separate fields is pretty different. (Side note: I don't understand why we are using singular orchestrator.label instead of plural orchestrator.labels). Further, it is different from how we manage container.labels (or in general labels) as of now.

I kind of like the idea of the orchestrator field because:

  1. the mapping will surely be smaller
  2. It is more in-line with how we use selectors with kubectl. For example, for kubectl get pod -L foo=bar, we could write orchestrator.label: "foo:bar" instead of kuberntes.labels.foo: bar, which feels more k8s-like (but less Kibana like).

It seems the list of fields that causes issues is a very limited set of fields and we need to pick one of the two behaviours. Is there a use of non keyword fields for labels and annotations? If not, keyword seems like the right default.

TLDR: I agree, keyword is the right call!

Long Answer: According to Kubernetes Documentation annotations value needs to strings. For labels we don't have a clear warning like for annotations, but I feel like they should be string as well. This issue seems to confirm it: https://github.com/kubernetes/kubernetes/issues/57509. Trying to create an object with a non-string labels will fails. Even if it would be allowed, labels are user-selectors and it may also happen that different teams would potentially use different types for the same labels. Opening space for additional problems. So, I sing keyword in any case is the right thing to do for the fields (and it is also what we are doing now with the kubernetes integration).

There is also a problem in data views in Kibana. Even though in the above scenario, different mappings should not exist, there are cases where in all logs logs-- it is valid to have different mapping types for the same field and not an error. For Elasticsearch it is not an error so it shouldn't be one for Kibana either. This issue has been raised here: https://github.com/elastic/kibana/issues/130242

While I appreciate the fact that having more precise datasets will avoid this kind of issues, I somehow feel it's kind of a betrayal of the original promise of ECS: be allowed to correlate different sources

Side note: The dedotting is not required anymore as we have introduced subobject false. This is a very recent addition.

Is it also in-use by the different integrations? Last time I checked it was not enabled

flash1293 commented 7 months ago

+1 to the orchestrator fields.

About

Change UI: Users are using it today, because the UI auto suggest it and it looks like it is the one you should use. If a user would just write an Elasticsearch query without auto complete, most users I assume would not add it because they don't know it is there in the first place. Would adjustments to the UI behaviour help here?

While possible, this doesn't seem like a good approach to me - if the .text field is there, it is actually quite helpful to query. If there isn't a use case, let's fix it on the mapping level.

ruflin commented 6 months ago

While I appreciate the fact that having more precise datasets will avoid this kind of issues, I somehow feel it's kind of a betrayal of the original promise of ECS: be allowed to correlate different sources

I'm talking here about fields which are not part of ECS. ECS is the foundation but users always have their own fields. It is ok if these are different in different datasets. Kibana should only complain, if the ECS fields are not the same.

if the .text field is there, it is actually quite helpful to query.

I wonder if most users know the difference between querying / filtering on a text vs keyword field? In quite a few cases, both are needed and we should support users in making the right choice.

What are our next steps here? I expect the kubernetes fields problem to stick around for a long time even if we recommend moving to orchestrator fields. One idea: Have a special mapping for labels to ensure these are ALWAYS keywords (and only keywords) in the ecs template?

flash1293 commented 6 months ago

What are our next steps here? I expect the kubernetes fields problem to stick around for a long time even if we recommend moving to orchestrator fields. One idea: Have a special mapping for labels to ensure these are ALWAYS keywords (and only keywords) in the ecs template?

You mean keyword-only for *.annotations? That would still leave the question of what to do with pod.name - switching *.name to keyword-only is probably going too far.

Another idea would be to always add logs@custom to the stack of component templates. This way, the user could specify these mappings in a central place, which doesn't completely automate the case, but at least makes it easier to manage.

ruflin commented 6 months ago

You mean keyword-only for *.annotations?

Pick the fields where it is required to have keyword only and ensure either flattened or not flattened is aligned with kubernetes integrations or swap it.

For .text, I would not change it but rather think how users could opt out or as described above, could change the UI slightly.

About logs@custom, I assume it is currently there for all non integration datasets but for integrations, it is not included? I agree, it should be there not only for this case.

flash1293 commented 6 months ago

About logs@custom, I assume it is currently there for all non integration datasets but for integrations, it is not included? I agree, it should be there not only for this case.

It is in the main logs template (non-integration logs streams in DSNS), but not for integrations:

Screenshot 2024-05-14 at 14 12 45

Should be a simple change though

ruflin commented 6 months ago

@kpollich ^

I assume priority wise it would sit after the @package template?

kpollich commented 6 months ago

I assume priority wise it would sit after the @package template?

This makes sense to me, the new composed_of order of component templates for logs data streams would then be

logs@mappings
logs@settings
logs-{integration.name}.{dataset}@package
logs@custom
logs-{integration.name}.{dataset}@custom
ecs@mappings
.fleet-global-1
.fleet_agent_id_verification-1

General methodology here as I understand it is "stack-defined mappings" -> "package defined mappings" -> "user defined mappings" -> "ecs mappings" -> "fleet metadata". Am I understanding the intent correctly here?

ruflin commented 6 months ago

the twist around ecs-mappings is that these are all dynamic. So if a more specific template existed before, it will still win (at least that is my understanding).

kpollich commented 6 months ago

I took a look at the code in Kibana to see if this would be a quick change, and it's a little more involved than I initially thought. We need to do a bit of a refactor because the @package and @custom templates are coupled, so inserting another template between them requires some reworking of the code.

https://github.com/elastic/kibana/blob/e76c5edb2a80c14f91b82bf02f29c6d9175ce5e8/x-pack/plugins/fleet/server/services/epm/elasticsearch/template/template.ts#L117-L125