elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
112 stars 4.93k forks source link

Use fingerprint file identity by default and migrate all existing filestream inputs to it #40197

Open belimawr opened 4 months ago

belimawr commented 4 months ago

We started seeing cases where the filestream-monitoring will log an entry stating the Elastic-Agent log file was truncated:

{
  "log.level": "info",
  "@timestamp": "2024-07-05T15:42:27.135Z",
  "message": "File was truncated. Reading file from offset 0. Path=/var/lib/elastic-agent/data/elastic-agent-8.14.1-1348b9/logs/elastic-agent-20240705-25.ndjson",
  "component": {
    "binary": "filebeat",
    "dataset": "elastic_agent.filebeat",
    "id": "filestream-monitoring",
    "type": "filestream"
  },
  "log": {
    "source": "filestream-monitoring"
  },
  "service.name": "filebeat",
  "id": "filestream-monitoring-agent",
  "path": "/var/lib/elastic-agent/data/elastic-agent-8.14.1-1348b9/logs/elastic-agent-20240705-25.ndjson",
  "ecs.version": "1.6.0",
  "log.logger": "input.filestream",
  "log.origin": {
    "file.line": 300,
    "file.name": "filestream/input.go",
    "function": "github.com/elastic/beats/v7/filebeat/input/filestream.(*filestream).openFile"
  },
  "source_file": "filestream::filestream-monitoring-agent::native::524773-66308",
  "state-id": "native::524423-66428"
}

We know the Elastic-Agent does not truncate files, the only reason for Filebeat to detect a truncation is if inodes are re-used.

The best way to prevent inode reuse from affecting the monitoring logs is to use a file identity that won't be re-used, the best option for that is the fingerprint file identity.

We should either start using the fingerprint file identity or at least make it configurable so users facing this issue can work around the inode reuse.

One very important thing to bear in mind is that if we change the file identity, all existing files will be considered new and re-ingested.

cmacknz commented 4 months ago

Is there a reason we can't change the file identity starting from a specific point in the registry at a conceptual level?

Can the file identity used for a registry entry be included in the entry itself, so we can deal with a transition period? That is the configured file identity becomes used for new events, but we can still deal with old events?

This would let us switch to fingerprint everywhere without concern about data duplication.

CC @rdner

belimawr commented 4 months ago

Is there a reason we can't change the file identity starting from a specific point in the registry at a conceptual level?

At a conceptual level, no there isn't a reason.

The file identity is part of the key we use in the registry, we have some abstractions to generate the key from a "source", in simple terms it concatenates: input type, input ID, file identity and the actual identifier for the file, e.g: filestream::filestream-monitoring-agent::native::524042-42000. The parts of the code responsible for generating registry key and fetching data from the registry are not really connected. When fetching an entry from the registry we need to know the file identity that was used to generate the key.

Because the input type and the input ID works as a sort of namespace in the registry, an input can fetch all of its entries, check whether the file identity is different, if it is we open the file, calculate the new file identity and save it in the registry as a new entry. This will even work for the autodiscover case where a input instance is created for each file.

After the migration we can proceed with a normal execution.

Can the file identity used for a registry entry be included in the entry itself, so we can deal with a transition period?

Not without re-writing the whole registry and interactions with it, but I believe the migration I mentioned above will work well.

ycombinator commented 4 months ago

Because the input type and the input ID works as a sort of namespace in the registry, an input can fetch all of its entries, check whether the file identity is different, if it is we open the file, calculate the new file identity and save it in the registry as a new entry. This will even work for the autodiscover case where a input instance is created for each file.

@belimawr Please correct me if I'm wrong but the above seems to be the key migration work that needs to be implemented in Beats to resolve this issue? If so, we can transfer this issue to the Beats repository.

cmacknz commented 4 months ago

I transferred it, and also renamed the issue since I think the value of migrating everything to fingerprint by default is much higher than the original title specific to the filestream-monitoring instance.

elasticmachine commented 4 months ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

belimawr commented 4 months ago

I transferred it, and also renamed the issue since I think the value of migrating everything to fingerprint by default is much higher than the original title specific to the filestream-monitoring instance.

What would be "all filestream inputs" here? All started by the Elastic-Agent? By integrations? Or it becomes more focused on implementing the option to switch file identity without losing the state?

cmacknz commented 4 months ago

It means all of them regardless of whether they run in Beats or Elastic Agent, unless there is a reason we think non-fingerprint file identities are superior in some cases.

This would perhaps be better titled, "Use fingerprint file identity by default and migrate all existing filestream inputs to it".

jlind23 commented 2 months ago

@pierrehilbert I think you can plan this one as soon as possible in order to make it land in 9.0 and give it enough time to soak in test environment.

jlind23 commented 6 days ago

@belimawr looks like this is already under implementation, is there a PR we can link to this issue?

belimawr commented 6 days ago

@belimawr looks like this is already under implementation, is there a PR we can link to this issue?

Yes, it's in draft because I just pushed code that works, but still needs work and tests. Honestly, I only pushed it to run the whole test suit and see where it breaks.

I'll update the title and add this issue in the related issue section today.