Duplicate data collection between generic inputs and specific integrations

ChrsMark commented 2 years ago

Today, we have generic inputs like Container Logs or Custom Logs. Based on dynamic variables and various providers we can also set this generic inputs to collect logs from all of the log files in a specific path. For example in Kubernetes (or similarly in Docker) we can set the path like /var/log/containers/*${kubernetes.container.id}.log and we will have the input to be set for all the containers based on their container.id value.

In addition we can specify

- name: nginx
  type: nginx/logs
  use_output: default
  data_stream:
    namespace: default
  streams:
    - data_stream:
        dataset: nginx.access
        type: logs
      paths:
        - '/var/log/containers/*${kubernetes.container.id}.log'
      condition: ${kubernetes.labels.app} == 'nginx'

which is an input based on a condition for nginx. In the end we will have this Pod's logs to be collected twice, from the generic Container Logs input and one for the specific one for Nginx.

cc: @mukeshelastic

ph commented 2 years ago

I see duplication of logs because we have two inputs collecting the same files, because agent or event beats do not track reading of files into a unique

But what would be the correct behavior or what do you think Elastic Agent should do in this case?

Is this something that should be solved by making sure the data is collected once from the container logs have the necessary field associated to the event so maybe a routing pipeline can make the right decisions? @ruflin ?

ruflin commented 2 years ago

The ideal solution from my perspective is the routing pipeline. What we should not do is try to figure out that we collect a lot file twice in the above case in Elastic Agent. The reason is that we should keep inputs independent of each other.

At the same time, the above is a problem we should have a solution for also in the short term. Any ideas on how we could do this?

ph commented 2 years ago

@ruflin Actually, with the changes to filestream and ids, this makes reading multiple files a lot easier, I should say possible. Before moving to filestream it wasn't possible to read multiple times from the same files. I believe we were logging an error.

The only thing the agent or the input logic could do is to do consolidation of log paths, but I suspect this would be fragile and harder to maintain. And also which or what options have precedence? This makes me follow the same path as @ruflin input are independent and the agent that create them should consider them as independent.

Should we invest instead in making sure the events are have the appropriate fields and investigate the routing pipeline?

I'll keep the issue open as a discussion and remove the bug label.

@mukeshelastic @ChrsMark I would be interested to read you perspective on this.

ChrsMark commented 2 years ago

Routing pipeline sounds promising to me.

However I would make a step back and think again of the UX in this specific case. So what's the problem that we want to solve here? "We want to collect all containers' logs by default and for some of them we want to handle them using integrations". In Beats we do it quite easily by using condition based autodiscovery and we fall back to the default container input+path when the conditions do not match. For example:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      templates:
        - condition:
            equals:
              kubernetes.container.image: "redis"
          config:
            - module: redis
              log:
                input:
                  type: container
                  paths:
                    - /var/log/containers/*-${data.kubernetes.container.id}.log

This would mean that a Pod that matches the kubernetes.container.image: "redis" condition will be handled by the redis module and everything else ~~will fall back to the default container input+path~~ will be ignored unless we set a generic condition as a fallback (or we add this fallback as feature similar to what we have for hints defaults). The idea is better illustrated for hints: https://github.com/elastic/beats/blob/main/deploy/kubernetes/filebeat-kubernetes.yaml#L23-L31

In Agent+Fleet I would expect something similar like: "Enable generic container input for all Pods/containers and be able to define specific integrations based on conditions". Could that be possible to be defined as part of the container_logs data_stream?

ChrsMark commented 2 years ago

I think this issue might be solved by hints based autodiscovery: https://github.com/elastic/elastic-agent/issues/662#issuecomment-1173827225.

Sth like the following would indicate in the hints to use the specific integration logs otherwise to fallback to the generic container_logs input:

annotations:
  co.elastic.hints/package: redis
  co.elastic.hints/logs: enable
  co.elastic.hints/data_streams: info, key
  co.elastic.hints/host: '${kubernetes.pod.ip}:6379'
  co.elastic.hints/info.period: 1m
  co.elastic.hints/key.period: 10m

An annotation like co.elastic.hints/logs: enable indicates to collect logs using the package, either specific data_stream(s) if defined or the default log data_stream if the package defines any. If a user define this, then should be sure that the specific package has support for logs otherwise the logs of this container will be skipped completely.

ChrsMark commented 2 years ago

Without hints we can still handle it:

In standalone mode when the user will add an input block with a condition like condition: ${kubernetes.labels.app} == 'nginx' then they would need to also add condition: ${kubernetes.labels.app} != 'nginx' in the container_logs input section.

In managed mode, Fleet UI could apply such "smart" logic and whenever a user add an integration with a condition like condition: ${kubernetes.labels.app} == 'nginx' then Fleet UI can detect it and automatically add a condition: ${kubernetes.labels.app} != 'nginx' in container_logs section in the same input policy (if container_logs exists). Fleet UI has access to every policy so such smart automations could take place. Not sure how many "AND"s a condition can support but I guess we could handle it accordingly.

cc: @ph @ruflin @mukeshelastic

elastic / elastic-agent

Duplicate data collection between generic inputs and specific integrations #274