Open showuon opened 2 months ago
Thanks @showuon! This may be related with KIP-1018. We have this ticket to track the fix and document it: https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/issues/483
Could you give a try to the new configuration on the broker side? Setting remote.fetch.max.wait.ms
to a few seconds should be enough to avoid the interruptions.
@showuon I ran into the same issue when remote.fetch.max.wait.ms
was too low but that might not be the only reason.
Another possible contributor that we noticed in other places in the code is the use of ForkJoinPool
with default concurrency (availableCores). You might run into concurrency issue where all the threads are monopolised by parallel requests and latency starts increasing exponentially due to head-of-line blocking (we ran into the same issue with ChunkCache).
Temporary solution could consist in making the fork join pool maximum size configurable but I believe the long term fix it to switch to async calls all the way up to plugin boundaries. WDYT @jeqo ?
@dopuskh3 good catch. Yes, the ForkJoinPools used by the fetch caching are defined by default (there are 3 of them: segment chunks, indexes, and manifest). We could start by adding some monitoring along the configurable size, so we could have some evidence on this being the main bottleneck. I've draft something here https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/pull/593 but plan to merge some refactoring before opening for review.
What happened?
When fetching data from AWS S3, I saw many error messages with interrupt exception. Although the data is fetched successfully and data is correct, I think we should try to fix the issue. If it's expected, maybe we should catch the exception and do some custom handling?
configs:
errors logs: https://gist.github.com/showuon/8ff9e18b062c7392c6e2b189f045dc3f
What did you expect to happen?
No error logs.
What else do we need to know?
N/A