elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
108 stars 4.93k forks source link

[aws] feat: aws-s3 input registry cleanup for untracked s3 objects #41694

Open Kavindu-Dodan opened 2 days ago

Kavindu-Dodan commented 2 days ago

Proposed commit message

This PR partially addresses https://github.com/elastic/beats/issues/39116 by introducing a registry cleanup strategy for aws-s3 input.

The cleanup implemented here removes registry entries if the s3 object is no longer available (aka tracked) when listing inside the polling lookup. The cleanup removes objects that are not tracked from both the local state and internal store (backed by the registry) to reduce the memory usage.

Note that, this only benefits when s3 objects get removed (ex:- using lifecycle policy) and are no longer available. There should be a follow-up for instances where such removal is not done at the bucket. For example, this could be done by,

Checklist

How to test this PR locally

Screenshots

Given below are pprof analyss comparisons for ~4K objects in registry and once they were clean up by removing S3 objects (emptying the bucket)

image

image

Related issues

elasticmachine commented 2 days ago

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

Kavindu-Dodan commented 1 day ago

Nice work @Kavindu-Dodan ! One quick question first, should we also introduce clean_removed parameter just to be matching the filestream input?

For s3 input, if clean_removed is set to false, then we will avoid cleanup of registry even for s3 objects that are no longer visible when listing through ListObjectsV2 API. This again can overload the in-memory state store. Besides, there's a core difference between filestream input and s3. In filestream input, we can continuously get new events as file get updated. Whereas in s3, we mark object processed as soon as we process and get ack for all events generated after processing it. So, personally I do not think we need to introduce clean_removed for s3 input unless we see another usage for this. Let me know your view.