[Fleet] Duplicate stream id with kubernetes provider

nchaulet commented 2 years ago

Description

Kubernetes provider is dynamically creating new stream for each container for example for that config, but the resulting streamwill have the same id

streams:
      - id: filestream-kubernetes.container_logs
        data_stream:
          dataset: kubernetes.container_logs
          type: logs
        prospector.scanner.symlinks: true
        paths:
          - '/var/log/containers/*${kubernetes.container.id}.log'

It is possible in a non managed policy to bypass that problem by providing an id with a dynamic variable like this

 streams:
      - id: >-
          filestream-kubernetes.container_logs-${kubernetes.container.id}
        data_stream:
          dataset: kubernetes.container_logs
          type: logs
        prospector.scanner.symlinks: true
        paths:
          - '/var/log/containers/*${kubernetes.container.id}.log'

How should we solve that for managed solutions?

Potential solutions

Fleet allow to configure the stream id

We use the id in Fleet to match streams so maybe we should provide a new id_suffix option or something like this to keep the id stable.

The kubernetes provider create dynamic id

I think this could be a more user friendly solution

elasticmachine commented 2 years ago

Pinging @elastic/fleet (Team:Fleet)

jlind23 commented 2 years ago

@nchaulet Per discussed, I do think that it should be fixed in the integration itself rather than giving this feature to the user. If we allow the user to manually set an ID, we may end up in the same situation if in the end it is not updated. @ChrsMark How are will it to fix it on your side rather than through the UI?

cc @ph @joshdover

ChrsMark commented 2 years ago

Can we make more clear the proposal of "fixing it on our side" please?

In addition, what we miss here is the fact that this is not a problem with Kubernetes Integration only but for every single integration/package that uses the "dynamic variable resolution" + "provider" features to populate inputs. For example at the moment this can happen with the docker provider too or any other provider like env, local dynamic etc. So even if we fix it on Kubernetes' integration level then we still miss the fix for other integrations that exists now or will exist in the feature.

In this regard I believe that we should avoid transferring the issue's fix on top level, integrations' developers, and rather fix it in the core. In the end this id is just a technicality that would be quite confusing to expose it and make it configurable in the higher levels.

Let me know what you think.

jlind23 commented 2 years ago

and rather fix it in the core. Does the core here stand for Beats?

Elastic Agent and beats are "dumb", they do take a configuration from fleet ui and transfer it to each and every underlying process currently running. I think it should be fixed in the integration itself.

"dynamic variable resolution" + "provider"

It implies an issue as soon as you are ingesting multiple files on the same pod in this particular case. Couldn't we just add an extra parameter to this ID which will solve the problem? If it has to be fixed for every integration doing it, then we should do it.

ChrsMark commented 2 years ago

and rather fix it in the core. Does the core here stand for Beats?

Elastic Agent and beats are "dumb", they do take a configuration from fleet ui and transfer it to each and every underlying process currently running. I think it should be fixed in the integration itself.

If I am not mistaken the generation of this ID happens in Fleet and is just a hidden implementation detail that affects filestream's down to Beat's. In this regard how it should be solved on integration level? Can you make it more specific so as to be able to better evaluate this proposal?

"dynamic variable resolution" + "provider"

It implies an issue as soon as you are ingesting multiple files on the same pod in this particular case. Couldn't we just add an extra parameter to this ID which will solve the problem? If it has to be fixed for every integration doing it, then we should do it.

To my mind this is "hacky" and does not scale. How we could prevent users+devs from forgetting to take care of this detail in the future? At the moment a user can even break the custom_logs integration by adding a reference to a dynamic variable in the paths section. So to my mind we cannot have a clear and robust solution on integration level. Fleet should be able to understand that an input has references to dynamic variables and add the suffix in the ID accordingly (or sth similar).

jlind23 commented 2 years ago

@nchaulet @joshdover could you please chime in here?

If I am not mistaken the generation of this ID happens in Fleet and is just a hidden implementation detail that affects filestream's down to Beat's. In this regard how it should be solved on integration level? Can you make it more specific so as to be able to better evaluate this proposal?

Sorry, I thought that the

"dynamic variable resolution" + "provider"

was done by the integration itself and not by Fleet.

mlunadia commented 2 years ago

@ruflin this feels like a technical leadership decision of where such logic should live. As mentioned by @ChrsMark I also support that we aim for a future proof approach where this logic is not part of the integration. wdyt?

ruflin commented 2 years ago

I don't think we should expose this to the user in Fleet. It is hard to explain to the user why this is needed in the first place, we need to fix it for the user. But the same is true in the standalone case, it is tricky for the user to get right.

This hints in the direction of fixing this in Elastic Agent but more importantly, lets focus on the potential solutions before making a decision on where the solution must be.

mlunadia commented 2 years ago

100% agree on not exposing it to the user. I do think that integrations should ideally not be massaging data to avoid potential breaking issues in the future that can be harder to track or debug down the line but if there's a critical reason for it let's chat about it.

belimawr commented 2 years ago

This hints in the direction of fixing this in Elastic Agent but more importantly, lets focus on the potential solutions before making a decision on where the solution must be.

++ to that.

What we really need to do is to ensure that Filebeat won't receive multiple inputs (at this moment we're focusing on filestream, but probably we should generalise this concept to avoid similar issues in the future) with the same ID and the IDs need to be stable among restarts/re-configurations.

It seems every auto-discovery could lead to this situation where a policy is used as template by Elastic-Agent to configure Filebeat. One possible solution could be to get a intrinsic identifier of the entity being discovered (container ID on kubernetes case) and append it to the input ID.

Given that, Elastic-Agent seems to be the one holding all necessary information to implement it.

belimawr commented 2 years ago

I also totally agree with not exposing it to the user. It's gonna be pretty hard for user to get it right in all situations.

jlind23 commented 2 years ago

If we put the logic at the agent level, it means that we will have to store it somewhere and make it persistent. Otherwise, as soon as the agent is restarted the input ID will be either lost or generated again.

But it means that agent will have to parse the entire policy and then applying some magic to the ID sent from Fleet, am I correct?

What happened if the integration config is modified? We modify the path where we are reading logs, does the agent have to regenerate an ID?

cc @ph @blakerouse as this conversation may have an impact on the Elastic Agent

belimawr commented 2 years ago

If we put the logic at the agent level, it means that we will have to store it somewhere and make it persistent. Otherwise, as soon as the agent is restarted the input ID will be either lost or generated again.

I don't believe so. Kubernetes is a good example: if we use the container ID, then the agent does not need to store it, it just needs to know how to use it for kubernetes. The same goes to docker containers. I'm assuming that any autodiscover is capable of providing and intrinsic-and-stable-id for the thing it's discovering (like the container ID for kubernetes/docker).

But it means that agent will have to parse the entire policy and then applying some magic to the ID sent from Fleet, am I correct?

Yes and No. It does not need to be the whole policy, only the bits that are used as a template. As far as I understand for Kubernetes autodiscover the agent is the one running the autodiscover and then for each new container, it gets the "template" set in the policy, resolve the variables there (like the container ID in the file path) and sends it to Filebeat. Those are the cases where the agent needs to modify the "policy" before sending to Filebeat.

What happened if the integration config is modified? We modify the path where we are reading logs, does the agent have to regenerate an ID?

No, I expect the final ID to be the same, something like: <what ever is in the policy>-<agent added suffix based on autodiscover type>

So for the current kubernetes integration we have as ID (see the first comment) filestream-kubernetes.container_logs, then the agent would add the suffix making the ID: filestream-kubernetes.container_logs-${kubernetes.container.id}. There is no need for the Agent to store any information regarding those inputs, it can always regenerate the same ID in a stateless-and-stable manner.

ph commented 2 years ago

@belimawr If I follow your code in the last paragraph nothing is required for the elastic agent it should be part of the author of the integration (or template) to ensure that the id used is unique enough or you are expecting that agent to automatically create it.

belimawr commented 2 years ago

Yes, both approaches (doing it on the integration or the agent) solve the problem, however how can we ensure all integrations using filestream input are setting unique-and-stable IDs? That's where solving it on agent side has got an advantage.

However I agree it's a ugly solution having the agent fiddling the policy/template.

cmacknz commented 2 years ago

Yes the ideal solution solves this in every case (fleet and standalone), with or without autodiscover involved.

We have the problem that we need stable and unique IDs for filestream to function as desired, but we have no way of generating IDs automatically and remembering them. Right now we want the user to do this and store the ID in the config, but they often don't and forcing them to do it by refusing to start or apply changes is a breaking change.

Is there something we could use or build that would let us automatically assign IDs to inputs, and store them in Elasticsearch to look up later?

ph commented 2 years ago

Could we express the id as a mathematic function and generate the id based on that? We might be able to create a definition that would be stable based on the input data. This would work well on dynamic provider. Something like i(x) where x is the input definition.

i(x) = container.id + p(x.paths) + d(x.data_stream.type, x.data_stream.type, x.data_stream.namespace) + other_settings(x.other_things))

ChrsMark commented 2 years ago

Can someone answer the question of which component is generating the ID in the current implementation? Is Fleet generating it when creating the policy?

belimawr commented 2 years ago

Could we express the id as a mathematic function and generate the id based on that? We might be able to create a definition that would be stable based on the input data. This would work well on dynamic provider. Something like i(x) where x is the input definition.

i(x) = container.id + p(x.paths) + d(x.data_stream.type, x.data_stream.type, x.data_stream.namespace) + other_settings(x.other_things))

That's not quite possible. The ID is what allows an input to know which entries in the registry belongs to it (aka, the files offset). So adding/removing files should not affect the ID. Ideally changing "other settings" would also not affect the ID. If the ID changes, all files are going to be read from the beginning.

The current data duplication happens during the input start up when it looks the registry for entries belonging to it about files it's no longer harvesting (have been removed from the config). It probably also happens on some registry clean up that runs on a schedule, but I never tested this case.

Another solution would be for the agent to always disable the old files clean up (set clean_removed: false). But I'm not quite sure how this would affect the performance and size of the registry in the long run.

Thinking strictly about the Elastic-Agent use case of Filebeat, this "clean up" feature might not be doing much aside duplicating data. I need to investigate more the codebase to understand better how it works after the clean up after the input initialisation.

If Filebeat knows it's running under Elastic-Agent, and it knows a filestream config is coming from the agent, then we could disable this clean up within Filebeat. With input IDs as ephemeral as the data source (containers and their logs) disabling the clean up might not be too bad (or not bad at all). I can give this solution a shot. It won't solve the problem for standalone Filebeat, but that's a discussion for another thread.

kvch commented 2 years ago

With input IDs as ephemeral as the data source (containers and their logs) disabling the clean up might not be too bad (or not bad at all).

These states have to be cleaned up. We have seen many issues in Filebeat in the log input. The input cannot clean up old states of containers, and the size of the registry gets too big. As you mentioned the states are ephemeral, Filebeat constantly encounters more and more containers. We cannot let the registry fill up with old garbage states.

kvch commented 2 years ago

Putting it here as well:

In the filestream input, you can provide a custom suffix to input IDs based on the configuration. We could add a custom suffix to the input ID in case of dynamic inputs: https://github.com/elastic/beats/blob/main/filebeat/input/filestream/prospector_creator.go#L44

I am not yet 100% sure, but I think we can pass the ${kubernetes.container.id} variable to the state store, so the states of dynamic inputs could be separated. So there is no need for extra heuristics or guessing which input the file belongs to.

jlind23 commented 2 years ago

@kvch it means that the fix will lie in the input right?

jlind23 commented 2 years ago

After discussing with @ph and @cmacknz closing this in favour of https://github.com/elastic/beats/issues/31512 as Elastic Agent Data Plane team will own the fix.

elastic / kibana