Aiven-Open / tiered-storage-for-apache-kafka

RemoteStorageManager for Apache Kafka® Tiered Storage
Apache License 2.0
90 stars 18 forks source link

many error logs output when fetching data from AWS S3 #584

Open showuon opened 2 weeks ago

showuon commented 2 weeks ago

What happened?

When fetching data from AWS S3, I saw many error messages with interrupt exception. Although the data is fetched successfully and data is correct, I think we should try to fix the issue. If it's expected, maybe we should catch the exception and do some custom handling?

configs:

[2024-08-30 15:04:52,250] INFO RemoteStorageManagerConfig values: 
    chunk.size = 4194304
    compression.enabled = false
    compression.heuristic.enabled = false
    custom.metadata.fields.include = []
    encryption.enabled = false
    key.prefix = 
    key.prefix.mask = false
    metrics.num.samples = 2
    metrics.recording.level = INFO
    metrics.sample.window.ms = 30000
    segment.manifest.cache.retention.ms = 3600000
    segment.manifest.cache.size = 1000
    storage.backend.class = class io.aiven.kafka.tieredstorage.storage.s3.S3Storage
 (org.apache.kafka.common.config.AbstractConfig)
[2024-08-30 15:04:52,253] INFO S3StorageConfig values: 
    aws.access.key.id = [hidden]
    aws.certificate.check.enabled = true
    aws.checksum.check.enabled = false
    aws.credentials.provider.class = null
    aws.secret.access.key = [hidden]
    s3.api.call.attempt.timeout = null
    s3.api.call.timeout = null
    s3.bucket.name = [hidden]
    s3.endpoint.url = null
    s3.multipart.upload.part.size = 5242880
    s3.path.style.access.enabled = null
    s3.region = ap-southeast-2
 (org.apache.kafka.common.config.AbstractConfig)

errors logs: https://gist.github.com/showuon/8ff9e18b062c7392c6e2b189f045dc3f

What did you expect to happen?

No error logs.

What else do we need to know?

N/A

jeqo commented 2 weeks ago

Thanks @showuon! This may be related with KIP-1018. We have this ticket to track the fix and document it: https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/issues/483 Could you give a try to the new configuration on the broker side? Setting remote.fetch.max.wait.ms to a few seconds should be enough to avoid the interruptions.

dopuskh3 commented 1 week ago

@showuon I ran into the same issue when remote.fetch.max.wait.ms was too low but that might not be the only reason.

Another possible contributor that we noticed in other places in the code is the use of ForkJoinPool with default concurrency (availableCores). You might run into concurrency issue where all the threads are monopolised by parallel requests and latency starts increasing exponentially due to head-of-line blocking (we ran into the same issue with ChunkCache).

Temporary solution could consist in making the fork join pool maximum size configurable but I believe the long term fix it to switch to async calls all the way up to plugin boundaries. WDYT @jeqo ?

jeqo commented 4 days ago

@dopuskh3 good catch. Yes, the ForkJoinPools used by the fetch caching are defined by default (there are 3 of them: segment chunks, indexes, and manifest). We could start by adding some monitoring along the configurable size, so we could have some evidence on this being the main bottleneck. I've draft something here https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/pull/593 but plan to merge some refactoring before opening for review.