didi / KnowStreaming

一站式云原生实时流数据平台,通过0侵入、插件化构建企业级Kafka服务,极大降低操作、存储和管理实时流数据门槛
https://knowstreaming.com
GNU Affero General Public License v3.0
6.9k stars 1.28k forks source link

KafkaClientPool连接池问题 #384

Closed Huyueeer closed 2 years ago

Huyueeer commented 2 years ago

扒日志发现以下报错信息:

2021-09-24 05:31:00.243 [Collect-Metrics-Thread-1-218] ERROR c.x.kafka.manager.service.cache.KafkaClientPool - borrow kafka consumer client failed, clusterDO:ClusterDO{id=11, clusterName='2-77', zookeeper='192.168.2.77:2181', bootstrapServers='192.168.2.77:9093', securityProperties='{
    "security.protocol": "SASL_PLAINTEXT",
    "sasl.mechanism": "PLAIN",
    "sasl.jaas.config": "org.apache.kafka.common.security.scram.ScramLoginModule required username=\"adc\" password=\"ccccccc\";"
}', jmxProperties='', status=1, gmtCreate=Sat May 08 05:54:38 UTC 2021, gmtModify=Mon Jul 26 01:10:35 UTC 2021}.
java.util.NoSuchElementException: Timeout waiting for idle object
    at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:439)
    at com.xiaojukeji.kafka.manager.service.cache.KafkaClientPool.borrowKafkaConsumerClient(KafkaClientPool.java:135)
    at com.xiaojukeji.kafka.manager.service.service.impl.TopicServiceImpl.getPartitionOffset(TopicServiceImpl.java:343)
    at com.xiaojukeji.kafka.manager.task.dispatch.metrics.collect.CollectAndPublishCGData.getTopicConsumerMetrics(CollectAndPublishCGData.java:121)
    at com.xiaojukeji.kafka.manager.task.dispatch.metrics.collect.CollectAndPublishCGData.access$000(CollectAndPublishCGData.java:35)
    at com.xiaojukeji.kafka.manager.task.dispatch.metrics.collect.CollectAndPublishCGData$1.call(CollectAndPublishCGData.java:82)
    at com.xiaojukeji.kafka.manager.task.dispatch.metrics.collect.CollectAndPublishCGData$1.call(CollectAndPublishCGData.java:79)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

貌似是连接池崩了,然后影响到了TopicService和一些Cron任务,在此期间kafka集群存在一些问题导致在重启(也包括手动重启操作)然后LogiKM就发生了以上问题。 这还是慢性问题,时间长了就会引发,猜测可能是资源没有释放导致。

ZQKC commented 2 years ago

扒日志发现以下报错信息:

2021-09-24 05:31:00.243 [Collect-Metrics-Thread-1-218] ERROR c.x.kafka.manager.service.cache.KafkaClientPool - borrow kafka consumer client failed, clusterDO:ClusterDO{id=11, clusterName='2-77', zookeeper='192.168.2.77:2181', bootstrapServers='192.168.2.77:9093', securityProperties='{
    "security.protocol": "SASL_PLAINTEXT",
    "sasl.mechanism": "PLAIN",
    "sasl.jaas.config": "org.apache.kafka.common.security.scram.ScramLoginModule required username=\"adc\" password=\"ccccccc\";"
}', jmxProperties='', status=1, gmtCreate=Sat May 08 05:54:38 UTC 2021, gmtModify=Mon Jul 26 01:10:35 UTC 2021}.
java.util.NoSuchElementException: Timeout waiting for idle object
  at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:439)
  at com.xiaojukeji.kafka.manager.service.cache.KafkaClientPool.borrowKafkaConsumerClient(KafkaClientPool.java:135)
  at com.xiaojukeji.kafka.manager.service.service.impl.TopicServiceImpl.getPartitionOffset(TopicServiceImpl.java:343)
  at com.xiaojukeji.kafka.manager.task.dispatch.metrics.collect.CollectAndPublishCGData.getTopicConsumerMetrics(CollectAndPublishCGData.java:121)
  at com.xiaojukeji.kafka.manager.task.dispatch.metrics.collect.CollectAndPublishCGData.access$000(CollectAndPublishCGData.java:35)
  at com.xiaojukeji.kafka.manager.task.dispatch.metrics.collect.CollectAndPublishCGData$1.call(CollectAndPublishCGData.java:82)
  at com.xiaojukeji.kafka.manager.task.dispatch.metrics.collect.CollectAndPublishCGData$1.call(CollectAndPublishCGData.java:79)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

貌似是连接池崩了,然后影响到了TopicService和一些Cron任务,在此期间kafka集群存在一些问题导致在重启(也包括手动重启操作)然后LogiKM就发生了以上问题。 这还是慢性问题,时间长了就会引发,猜测可能是资源没有释放导致。

基于你提供的信息,感觉还有可能是KafkaConsumer客户端没有被及时还回资源池。现在LogiKM使用的客户端版本是0.10.2的,集群出现问题后,KafkaConsumer调用接口的超时时间是15秒,如果Topic不存在等,需要的时间会更久,如果借用KafkaConsumer的地方非常多,那么就会导致池子里没有可用的客户端了。

KafkaConsumer资源池应该是做了集群间的隔离,即一个集群出问题,应该不会导致其他集群也报这个错误,不过线程池那块没有隔离,后续线程池那块也可以隔离一下,这样集群间指标采集任务的隔离能更彻底。还有一个是LogiKM的Kafka客户端版本,后续会进行升级。