Kafka brokers run out of disk space after few days

romanlv commented 4 years ago

Running standard configuration in Google cloud with ksql and connect disabled

It works fine for several days (3-4 days) with minimum usage (it's a dev environment) but eventually something occupies all available disk space

ERROR Error while loading log dir /opt/kafka/data-0/logs (kafka.log.LogManager)

java.io.IOException: No space left on device at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211) at kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:449) at kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:671) at kafka.log.Log.recoverSegment(Log.scala:652) at kafka.log.Log.recoverLog(Log.scala:788) at kafka.log.Log.$anonfun$loadSegments$3(Log.scala:724) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at kafka.log.Log.retryOnOffsetOverflow(Log.scala:2346) at kafka.log.Log.loadSegments(Log.scala:724) at kafka.log.Log.<init>(Log.scala:298) at kafka.log.Log$.apply(Log.scala:2480) at kafka.log.LogManager.loadLog(LogManager.scala:283) at kafka.log.LogManager.$anonfun$loadLogs$12(LogManager.scala:353) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

It looks like TRACE log level is active for kafka brokers, but i'm not sure how to change it, tried with KAFKA_LOG4J_ROOT_LOGLEVEL

cp-kafka:
  persistence:
    size: 10Gi
  customEnv:
    KAFKA_LOG4J_ROOT_LOGLEVEL: WARN

but it does not make any difference

How to change log level or enable logs rotation?

romanlv commented 4 years ago

Looks like setting log.retention should help

will try

cp-kafka:
  configurationOverrides:
    "log.retention.hours": 24

zulrang commented 3 years ago

The question is: why is it filling up space? This is happening with a default install, and nothing is actually using the cluster aside from itself.

BenMemi commented 3 years ago

Also experiencing this issue, above retention hours doesn't seem to have worked

zulrang commented 3 years ago

Yep. I even tried changing all the topics to a 1GB retention as well, and it still fills up after a couple days.

hextrim commented 3 years ago

I just deployed a cluster via operator which I guess would be the same thing as deploying it by charts. It ran out of disk immediately. I re run the deployment for kafka only like: `cat confluent-kafka-only.yml apiVersion: platform.confluent.io/v1beta1 kind: Kafka [...] configOverrides: server:

log.retention.hours=4 `

To change the retention to something smaller, however this wont clean up existing storage which is exhausted anyway.

Can I get any help, like some guidence of how to clean up filled up log space?

I may end up redeploying the whole thing with the overides above but seems like other have similar problem even with this flag enabled.

Any help much appreciated.

At pod boot time I get: [ERROR] 2021-08-26 15:45:00,617 [pool-7-thread-1] kafka.server.LogDirFailureChannel error - Error while writing to checkpoint file /mnt/data/data0/logs/_confluent_balancer_broker_samples-13/leader-epoch-checkpoint java.io.IOException: No space left on device

PiePra commented 2 years ago

We are facing the same issue using chart version: 0.6.1. Log retention is not addressing this issue. Increasing PVC size just delays the no space left on device.

nlonginow commented 2 years ago

I'm still seeing this issue. When I exec into the kafka broker pod, the file in question (opt/kafka/data-0...) does not even exist. Why does it say out of space when the file in question is not even there? BTW - I have all the log retention settings correct, and they show up in Confluent Control center as expected (ie, 1 hour retention, 1M size limit, etc.) It's like the kafka log retention code is not working at all.

payneBrandon commented 2 years ago

Seeing this same issue, has anyone made progress on this? We've tried overriding the log retention using both time and bytes size with no luck.

BenMemi commented 2 years ago

What you want to do is change the log retention policy to delete. That fixes the issue. I can drop my config file here if needed.

payneBrandon commented 2 years ago

@BenM-Mycelium thanks for the response! Are you meaning setting something like "log.cleanup.policy": "delete"?

BenMemi commented 2 years ago

Yes correct

payneBrandon commented 2 years ago

Testing this out now, thanks again for the help! For anyone else peeking in, an additional setting we're using that I didn't fully understand is the log.retention.bytes. When I looked at the documentation closer, this limit is enforced at the partition level and not the topic level. For my project, we're using 8 partitions (so 8x the limit that I anticipated) which left my disk woefully undersized. I'll let this run for a bit to see if the delete policy functions as expected.

BenMemi commented 2 years ago

How did you go out of interest?

payneBrandon commented 2 years ago

How did you go out of interest?

Hey @BenM-Mycelium, I'm just now seeing this reply sorry about that. I ended up boosting the disk size quite a bit and setting a short expiration (10 minutes) with a 0.25GB log.retention.bytes setting. At this point things are up and running and I can see the topics level off at appropriate size.

confluentinc / cp-helm-charts

Kafka brokers run out of disk space after few days #400