Prevent duplicate _id events from reaching the replay queue

elastic / elastic-serverless-forwarder

Elastic Serverless Forwarder

Other

34 stars 34 forks source link

Prevent duplicate _id events from reaching the replay queue #729

Open emilioalvap opened 3 weeks ago

emilioalvap commented 3 weeks ago

What does this PR do?

Fixes #677.

Check status codes for _bulk requests responses to detect _id collisions and prevent them from going into the replay queue.

Why is it important?

Checklist

[x] My code follows the style guidelines of this project
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] I have made corresponding change to the default configuration files
[x] I have added tests that prove my fix is effective or that my feature works
[ ] I have added an entry in CHANGELOG.md

emilioalvap commented 3 weeks ago

cc @bturquet

constanca-m commented 6 days ago

From my understanding from this PR, if ESF receives a document that was already sent to ES, then it will just skip it and continue the execution. In opposition to stopping and returning an error, like it is now. Can you confirm this? This seems ok with me.

Additionally, this PR needs to increase the ESF version and add an entry to the CHANGELOG.

gizas commented 5 days ago

From implementation side all are ok. The key point is the comment: events that were so close to each other that they were given the same timestamp

If we can verify that this does not happen and we can guarantee the uniqueness of the timestamp creation here then I think we are ok.

What is the timestamp's precision? I mean we include ms right as in here?

axw commented 4 days ago

If the timestamp is non-unique, then we would need to update how the _id field is computed. An option for that would be to hash the entire document, using something fast and non-cryptographic, like xxhash or murmur3.