Aiven-Open / tiered-storage-for-apache-kafka

RemoteStorageManager for Apache Kafka® Tiered Storage
Apache License 2.0
91 stars 19 forks source link

Some local partitions's segments don't get purged when over retention time #596

Open bingkunyangvungle opened 1 week ago

bingkunyangvungle commented 1 week ago

What happened?

We keep the local retention to be 20% of the total retention and the configuration looks like this:

      config = {
        "retention.ms"          = 86400000 / 8     #  3 hours 
        "local.retention.ms"    = 86400000 / 8 / 5 # 20% local data
        "remote.storage.enable" = true
      }

Normally for each partition there's about 9~11 segments stored locally, but sometimes for a certain partition, the cluster seems to 'forget' to delete the local segment that is out of the retention policy. As a result, the number of segments can grow continuously and the data size for the broker would go up non-stop as well, causing the issue of high disk utilization. After observing the issue, restart the Kafka service in the broker who is the leader for the partition with issue and the out-of-retention segments would be purged afterwards.

This is what happened before and after the restarting of the leader for the partition: image

Kafka version: 3.7.0

Tiered Storage version: 2024-04-02-1712056402

What did you expect to happen?

The out-of-retention segments would be purged automatically.

What else do we need to know?

Not sure whether this is the issue for the Kafka or the plugin. So maybe this submit might be a good start for discussion.

bingkunyangvungle commented 1 week ago

For the logs before the broker restart, there's only Rolled new log segment logs:

[2024-09-21 21:50:07,077] INFO [ProducerStateManager partition=topic-125] Wrote producer snapshot at offset 63538937203 with 0 producer ids in 1 ms. (org.apache.kafka.storage.internals.log.ProducerStateManager)
[2024-09-21 21:52:48,432] INFO [LocalLog partition=topic-125, dir=/data/kafka] Rolled new log segment at offset 63541597236 in 1 ms. (kafka.log.LocalLog)
[2024-09-21 21:52:48,432] INFO [ProducerStateManager partition=topic-125] Wrote producer snapshot at offset 63541597236 with 0 producer ids in 1 ms. (org.apache.kafka.storage.internals.log.ProducerStateManager)
[2024-09-21 21:55:29,915] INFO [LocalLog partition=topic-125, dir=/data/kafka] Rolled new log segment at offset 63544256614 in 0 ms. (kafka.log.LocalLog)

And this is the only logs found before restarting the broker for several hours, but no delete logs found.

jeqo commented 5 days ago

Sounds related to https://issues.apache.org/jira/browse/KAFKA-16511 -- could you try this on 3.7.1 and see if it's still an issue?