Filestream Fingerprint Mode: fingerprint should include filename

kbujold commented 4 months ago

Hi we wanted to switch to fingerprint mode with filestream. Introducing Filestream fingerprint mode

We have found an issue with id creations. If two files have the exact same content and have two different filename, fingerprint will see these as having the same file id. Not all logs contain a timestamp on a system. This is problematic.

2024-02-13T20:56:24.39360761Z stderr F {"log.level":"warn","@timestamp":"2024-02-13T20:56:24.393Z","log.logger":"scanner","log.origin":{"file.name":"filestream/fswatch.go","file.line":394},"message":"\"/var/log/test2.log\" points to an already known ingest target \"/var/log/test1.log\" [0e42b7661026c2eb37c8597817e38366eef19794fb5bb60e143823c800658fb9==0e42b7661026c2eb37c8597817e38366eef19794fb5bb60e143823c800658fb9]. Skipping","service.name":"filebeat","ecs.version":"1.6.0"}

For example

cat /var/log/k8s-account-creation-script.log
Token already created, skiping account and token file creation.
Token already created, skiping account and token file creation.

Would it be possible have the filename be part of the id creation?

Thank you, Kris

botelastic[bot] commented 4 months ago

This issue doesn't have a Team:<team> label.

strawgate commented 3 months ago

Hi,

The reason we fingerprint the file is that the path and name are not reliable ways to track a cursor for an input file.

Can you use an alternative identification method like the path or inode_marker methods?

daveneeley commented 1 month ago

Having also looked into fingerprint mode, there just isn't a perfect processor. Each method has tradeoffs. If it were possible, I would combine processors to get closer to uniqueness but filebeat does not support this.

The scenario could be quite common in a large kubernetes environment. Where multiple pods have the same startup routine (and no timestamps). Following that logic, it's in the realm of possibility that pods in different namespaces for different tenants (but running on the same node) could have the same fingerprint. Crazy things happen. :)

The filepath or inode has to be involved at some point, right? Once the contents of two log files that share the same fingerprint have diverged, which one continues to be tracked? After a filebeat pod is restarted or replaced with a new generation, which file gets tracked now?

It's a complicated problem. Being able to say "I have two files with the same fingerprint, but their initial paths are different (not their inodes), so I will treat them as different" would add value.

elastic / beats

Filestream Fingerprint Mode: fingerprint should include filename #38003