Cruise Control metrics reporter failing on k8s cluster with cgroup v2

robinvanderstraeten-klarrio commented 1 year ago

Description

Cruise control currently does not support running on a cluster with cgroup v2 when the configuration cruise.control.metrics.reporter.kubernetes.mode is set to true. (see https://github.com/linkedin/cruise-control/issues/1873) Koperator always sets this to true (https://github.com/banzaicloud/koperator/blob/v0.25.1/pkg/resources/kafka/configmap.go#L105) and AFAIK, there is currently no way to override this configuration.

Expected Behavior

The Cruise Control metrics collector should collect and publish metrics about the Kafka brokers.

Actual Behavior

The Cruise Control metrics collector crashes. The following appears once per minute in the logs of every broker:

[2023-08-16 14:28:31,040] WARN Failed reporting CPU util. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
java.io.FileNotFoundException: /sys/fs/cgroup/cpu/cpu.cfs_quota_us (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:112)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.ContainerMetricUtils.readFile(ContainerMetricUtils.java:62)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.ContainerMetricUtils.getCpuQuota(ContainerMetricUtils.java:42)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.ContainerMetricUtils.getContainerProcessCpuLoad(ContainerMetricUtils.java:92)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.MetricsUtils.getCpuMetric(MetricsUtils.java:409)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.reportCpuUtils(CruiseControlMetricsReporter.java:449)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.run(CruiseControlMetricsReporter.java:367)
        at java.base/java.lang.Thread.run(Thread.java:829)

This also has a side effect: Cruise Control doesn't seem to be able to deal with the fact that it is not getting these metrics. It's memory usage grows until it is eventually OOM killed.

Affected Version

Seen on version 0.24.1. Though this will be a problem on all versions where cruise.control.metrics.reporter.kubernetes.mode gets set to true.

Steps to Reproduce

Deploy a Kubernetes cluster with nodes that have cgroup v2.
Deploy koperator.
Deploy a basic KafkaCluster. Any configuration that also causes Cruise Control to be deployed should work.

Checklist

[x] I have read the contributing guidelines
[x] I have verified this does not duplicate an existing issue

panyuenlau commented 1 year ago

Thanks for reporting this, @robinvanderstraeten-klarrio! We've seen this behavior internally but didn't get the chance to create a dedicated GitHub issue

robinvanderstraeten-klarrio commented 1 year ago

Reading through the Cruise Control issue, it seems that simply removing the cruise.control.metrics.reporter.kubernetes.mode would fix this, but I'm not too knowledgeable about Cruise Control in general and the impact that this would have on a production deployment. If this would be a good solution, I'd be happy to contribute it.

panyuenlau commented 1 year ago

I don't think we should remove the cruise.control.metrics.reporter.kubernetes.mode configuration, this configuration was added to resolve CPU utilization reporting issue, see https://github.com/banzaicloud/koperator/issues/463

Perhaps the best way is to wait for upstream CC to fix their issue with cgroups v2 so we can adapt in Koperator

banzaicloud / koperator