Open Kavindu-Dodan opened 2 days ago
Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)
Nice work @Kavindu-Dodan ! One quick question first, should we also introduce
clean_removed
parameter just to be matching the filestream input?
For s3 input, if clean_removed
is set to false, then we will avoid cleanup of registry even for s3 objects that are no longer visible when listing through ListObjectsV2 API. This again can overload the in-memory state store. Besides, there's a core difference between filestream input and s3. In filestream input, we can continuously get new events as file get updated. Whereas in s3, we mark object processed as soon as we process and get ack for all events generated after processing it. So, personally I do not think we need to introduce clean_removed
for s3 input unless we see another usage for this. Let me know your view.
Proposed commit message
This PR partially addresses https://github.com/elastic/beats/issues/39116 by introducing a registry cleanup strategy for aws-s3 input.
The cleanup implemented here removes registry entries if the s3 object is no longer available (aka tracked) when listing inside the polling lookup. The cleanup removes objects that are not tracked from both the local state and internal store (backed by the registry) to reduce the memory usage.
Note that, this only benefits when s3 objects get removed (ex:- using lifecycle policy) and are no longer available. There should be a follow-up for instances where such removal is not done at the bucket. For example, this could be done by,
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Screenshots
Given below are pprof analyss comparisons for ~4K objects in registry and once they were clean up by removing S3 objects (emptying the bucket)
Related issues