didi / KnowStreaming

一站式云原生实时流数据平台,通过0侵入、插件化构建企业级Kafka服务,极大降低操作、存储和管理实时流数据门槛
https://knowstreaming.com
GNU Affero General Public License v3.0
6.99k stars 1.28k forks source link

Kafka JMX UnknownHostException #1009

Closed menghe999 closed 1 year ago

menghe999 commented 1 year ago

环境信息

重现该问题的步骤

添加一个kafka集群(已经开启kerberos jmx)

image

数据库中该集群的信息

image

预期结果

我希望平台能够监控到kafka集群的topic详细信息

实际结果

jmx端口正常

image

无法获取topic详情

image

如果有异常,请附上异常Trace:

2023-05-08 09:32:08.367 [http-nio-8090-exec-7] ERROR class=c.x.know.streaming.km.common.jmx.JmxConnectorWrap||JMX connect exception, clientLogIdent:clusterPhyId: null host:null port:9393.
java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ConfigurationException [Root exception is java.rmi.UnknownHostException: Unknown host: null; nested exception is: 
        java.net.UnknownHostException: null]
        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:369)
        at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270)
        at com.xiaojukeji.know.streaming.km.common.jmx.JmxConnectorWrap.createJmxConnector(JmxConnectorWrap.java:176)
        at com.xiaojukeji.know.streaming.km.common.jmx.JmxConnectorWrap.checkJmxConnectionAndInitIfNeed(JmxConnectorWrap.java:74)
        at com.xiaojukeji.know.streaming.km.persistence.jmx.impl.JmxDAOImpl.getJmxValue(JmxDAOImpl.java:30)
        at com.xiaojukeji.know.streaming.km.persistence.jmx.impl.JmxDAOImpl.getJmxValue(JmxDAOImpl.java:22)
        at com.xiaojukeji.know.streaming.km.persistence.jmx.impl.JmxDAOImpl$$FastClassBySpringCGLIB$$dbd70ca8.invoke(<generated>)
        at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:771)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
        at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:139)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:691)
        at com.xiaojukeji.know.streaming.km.persistence.jmx.impl.JmxDAOImpl$$EnhancerBySpringCGLIB$$ed03b232.getJmxValue(<generated>)
        at com.xiaojukeji.know.streaming.km.core.service.cluster.impl.ClusterValidateServiceImpl.checkZKLegalAndTryGetInfo(ClusterValidateServiceImpl.java:181)
        at com.xiaojukeji.know.streaming.km.core.service.cluster.impl.ClusterValidateServiceImpl.getDataAndIgnoreCheckBSLegal(ClusterValidateServiceImpl.java:133)
        at com.xiaojukeji.know.streaming.km.core.service.cluster.impl.ClusterValidateServiceImpl.checkKafkaLegal(ClusterValidateServiceImpl.java:72)
        at com.xiaojukeji.know.streaming.km.rest.api.v3.util.UtilsController.validateKafka(UtilsController.java:43)
ZQKC commented 1 year ago

还有其他错误日志么?你发的那个错误日志,不是主要的,仅会在接入集群时出现,该问题已在master分支上修复。

menghe999 commented 1 year ago

还有其他错误日志么?你发的那个错误日志,不是主要的,仅会在接入集群时出现,该问题已在master分支上修复。

# tailf log_error.log | grep 'clusterPhyId=1'
2023-05-08 12:08:06.464 [MetricCollect-Shard-1-9-thread-6] ERROR class=c.x.k.s.k.c.s.h.c.c.HealthCheckClusterService||method=checkClusterNoController||param=ClusterPhyParam(clusterPhyId=1)||config=HealthCompareValueConfig(value=1.0)||errMsg=get metrics from es failed, activeControllerCount is null
2023-05-08 12:08:06.511 [MetadataTaskTP-6-thread-11] ERROR class=c.x.k.s.k.t.k.metadata.SyncBrokerConfigDiffTask||method=processSubTask||clusterPhyId=1||data=BrokerConfigPO(clusterPhyId=1, brokerId=66, configName=listeners, configValue=SASL_PLAINTEXT://pa3:6868,, diffType=1)||errMsg=exception!
ZQKC commented 1 year ago

还有其他错误日志么?你发的那个错误日志,不是主要的,仅会在接入集群时出现,该问题已在master分支上修复。

# tailf log_error.log | grep 'clusterPhyId=1'
2023-05-08 12:08:06.464 [MetricCollect-Shard-1-9-thread-6] ERROR class=c.x.k.s.k.c.s.h.c.c.HealthCheckClusterService||method=checkClusterNoController||param=ClusterPhyParam(clusterPhyId=1)||config=HealthCompareValueConfig(value=1.0)||errMsg=get metrics from es failed, activeControllerCount is null
2023-05-08 12:08:06.511 [MetadataTaskTP-6-thread-11] ERROR class=c.x.k.s.k.t.k.metadata.SyncBrokerConfigDiffTask||method=processSubTask||clusterPhyId=1||data=BrokerConfigPO(clusterPhyId=1, brokerId=66, configName=listeners, configValue=SASL_PLAINTEXT://pa3:6868,, diffType=1)||errMsg=exception!

除了UnknownHost这个错误之外,还有没有连接jmx失败的日志?

menghe999 commented 1 year ago

还有其他错误日志么?你发的那个错误日志,不是主要的,仅会在接入集群时出现,该问题已在master分支上修复。

# tailf log_error.log | grep 'clusterPhyId=1'
2023-05-08 12:08:06.464 [MetricCollect-Shard-1-9-thread-6] ERROR class=c.x.k.s.k.c.s.h.c.c.HealthCheckClusterService||method=checkClusterNoController||param=ClusterPhyParam(clusterPhyId=1)||config=HealthCompareValueConfig(value=1.0)||errMsg=get metrics from es failed, activeControllerCount is null
2023-05-08 12:08:06.511 [MetadataTaskTP-6-thread-11] ERROR class=c.x.k.s.k.t.k.metadata.SyncBrokerConfigDiffTask||method=processSubTask||clusterPhyId=1||data=BrokerConfigPO(clusterPhyId=1, brokerId=66, configName=listeners, configValue=SASL_PLAINTEXT://pa3:6868,, diffType=1)||errMsg=exception!

除了UnknownHost这个错误之外,还有没有连接jmx失败的日志?

tail -5000f log_error.log  | grep 'clusterPhyId=1' 
....
2023-05-11 15:48:34.479 [MetricCollect-Shard-2-10-thread-48] ERROR class=c.x.k.s.k.c.s.h.c.topic.HealthCheckTopicService||method=checkTopicUnderReplicatedPartition||param=TopicParam{clusterPhyId=1, topicName='producer-test-DefaultPartitioner-10'}||config=HealthDetectedInLatestMinutesConfig(latestMinutes=10, detectedTimes=8)||result=Result{message='失败', code=1, data=null}||errMsg=search metrics from es failed
2023-05-11 15:48:34.479 [MetricCollect-Shard-2-10-thread-22] ERROR class=c.x.k.s.k.c.s.h.c.topic.HealthCheckTopicService||method=checkTopicUnderReplicatedPartition||param=TopicParam{clusterPhyId=1, topicName='__consumer_offsets'}||config=HealthDetectedInLatestMinutesConfig(latestMinutes=10, detectedTimes=8)||result=Result{message='失败', code=1, data=null}||errMsg=search metrics from es failed
2023-05-11 15:49:04.636 [MetricCollect-Shard-1-9-thread-47] ERROR class=c.x.k.s.k.c.s.h.c.c.HealthCheckClusterService||method=checkClusterNoController||param=ClusterPhyParam(clusterPhyId=1)||config=HealthCompareValueConfig(value=1.0)||errMsg=get metrics from es failed, activeControllerCount is null

基本都是method=checkxxx打印的异常日志。

我有把这个集群注册到logikm v2.6上,topic的流量信息可以正常显示的,是不是可以确认kafka集群配置的没有问题。

image
szflfeiyu commented 1 year ago

顺便我问下我,配置broker的时候填写jmx端口,表里也是对的,但是指标一直获取不到是什么原因呢?单独获取一直显示9099端口,配置的不生效 2023-05-12 09:33:38.990 ERROR 2380 --- [kTP-5-thread-13] c.x.k.s.km.common.jmx.JmxConnectorWrap : JMX connect exception, clientLogIdent:clusterPhyId: 1 brokerId: 2 host:b-2.pre-spot-market.xvod6s.c4.kafka.ap-southeast-1.amazonaws.com port:9099. at com.xiaojukeji.know.streaming.km.common.jmx.JmxConnectorWrap.createJmxConnector(JmxConnectorWrap.java:176) at com.xiaojukeji.know.streaming.km.common.jmx.JmxConnectorWrap.checkJmxConnectionAndInitIfNeed(JmxConnectorWrap.java:74) 2023-05-12 09:33:38.990 ERROR 2380 --- [kTP-5-thread-13] c.x.k.s.k.p.kafka.KafkaJMXClient : method=getClientWithCheck||clusterPhyId=1||brokerId=2||msg=get jmx connector failed! 2023-05-12 09:33:48.974 ERROR 2380 --- [d-1-9-thread-26] c.x.k.s.km.common.jmx.JmxConnectorWrap : JMX connect exception, clientLogIdent:clusterPhyId: 1 brokerId: 1 host:b-1.pre-spot-market.xvod6s.c4.kafka.ap-southeast-1.amazonaws.com port:9099. at com.xiaojukeji.know.streaming.km.common.jmx.JmxConnectorWrap.createJmxConnector(JmxConnectorWrap.java:176) at com.xiaojukeji.know.streaming.km.common.jmx.JmxConnectorWrap.checkJmxConnectionAndInitIfNeed(JmxConnectorWrap.java:74) 2023-05-12 09:33:48.974 ERROR 2380 --- [d-1-9-thread-26] c.x.k.s.k.p.kafka.KafkaJMXClient : method=getClientWithCheck||clusterPhyId=1||brokerId=1||msg=get jmx connector failed!

ZQKC commented 1 year ago

还有其他错误日志么?你发的那个错误日志,不是主要的,仅会在接入集群时出现,该问题已在master分支上修复。

# tailf log_error.log | grep 'clusterPhyId=1'
2023-05-08 12:08:06.464 [MetricCollect-Shard-1-9-thread-6] ERROR class=c.x.k.s.k.c.s.h.c.c.HealthCheckClusterService||method=checkClusterNoController||param=ClusterPhyParam(clusterPhyId=1)||config=HealthCompareValueConfig(value=1.0)||errMsg=get metrics from es failed, activeControllerCount is null
2023-05-08 12:08:06.511 [MetadataTaskTP-6-thread-11] ERROR class=c.x.k.s.k.t.k.metadata.SyncBrokerConfigDiffTask||method=processSubTask||clusterPhyId=1||data=BrokerConfigPO(clusterPhyId=1, brokerId=66, configName=listeners, configValue=SASL_PLAINTEXT://pa3:6868,, diffType=1)||errMsg=exception!

除了UnknownHost这个错误之外,还有没有连接jmx失败的日志?

tail -5000f log_error.log  | grep 'clusterPhyId=1' 
....
2023-05-11 15:48:34.479 [MetricCollect-Shard-2-10-thread-48] ERROR class=c.x.k.s.k.c.s.h.c.topic.HealthCheckTopicService||method=checkTopicUnderReplicatedPartition||param=TopicParam{clusterPhyId=1, topicName='producer-test-DefaultPartitioner-10'}||config=HealthDetectedInLatestMinutesConfig(latestMinutes=10, detectedTimes=8)||result=Result{message='失败', code=1, data=null}||errMsg=search metrics from es failed
2023-05-11 15:48:34.479 [MetricCollect-Shard-2-10-thread-22] ERROR class=c.x.k.s.k.c.s.h.c.topic.HealthCheckTopicService||method=checkTopicUnderReplicatedPartition||param=TopicParam{clusterPhyId=1, topicName='__consumer_offsets'}||config=HealthDetectedInLatestMinutesConfig(latestMinutes=10, detectedTimes=8)||result=Result{message='失败', code=1, data=null}||errMsg=search metrics from es failed
2023-05-11 15:49:04.636 [MetricCollect-Shard-1-9-thread-47] ERROR class=c.x.k.s.k.c.s.h.c.c.HealthCheckClusterService||method=checkClusterNoController||param=ClusterPhyParam(clusterPhyId=1)||config=HealthCompareValueConfig(value=1.0)||errMsg=get metrics from es failed, activeControllerCount is null

基本都是method=checkxxx打印的异常日志。

我有把这个集群注册到logikm v2.6上,topic的流量信息可以正常显示的,是不是可以确认kafka集群配置的没有问题。 image

配置既然没有 问题,那么看一下es/es.log,看看查询es是否有异常。

ZQKC commented 1 year ago

顺便我问下我,配置broker的时候填写jmx端口,表里也是对的,但是指标一直获取不到是什么原因呢?单独获取一直显示9099端口,配置的不生效 2023-05-12 09:33:38.990 ERROR 2380 --- [kTP-5-thread-13] c.x.k.s.km.common.jmx.JmxConnectorWrap : JMX connect exception, clientLogIdent:clusterPhyId: 1 brokerId: 2 host:b-2.pre-spot-market.xvod6s.c4.kafka.ap-southeast-1.amazonaws.com port:9099. at com.xiaojukeji.know.streaming.km.common.jmx.JmxConnectorWrap.createJmxConnector(JmxConnectorWrap.java:176) at com.xiaojukeji.know.streaming.km.common.jmx.JmxConnectorWrap.checkJmxConnectionAndInitIfNeed(JmxConnectorWrap.java:74) 2023-05-12 09:33:38.990 ERROR 2380 --- [kTP-5-thread-13] c.x.k.s.k.p.kafka.KafkaJMXClient : method=getClientWithCheck||clusterPhyId=1||brokerId=2||msg=get jmx connector failed! 2023-05-12 09:33:48.974 ERROR 2380 --- [d-1-9-thread-26] c.x.k.s.km.common.jmx.JmxConnectorWrap : JMX connect exception, clientLogIdent:clusterPhyId: 1 brokerId: 1 host:b-1.pre-spot-market.xvod6s.c4.kafka.ap-southeast-1.amazonaws.com port:9099. at com.xiaojukeji.know.streaming.km.common.jmx.JmxConnectorWrap.createJmxConnector(JmxConnectorWrap.java:176) at com.xiaojukeji.know.streaming.km.common.jmx.JmxConnectorWrap.checkJmxConnectionAndInitIfNeed(JmxConnectorWrap.java:74) 2023-05-12 09:33:48.974 ERROR 2380 --- [d-1-9-thread-26] c.x.k.s.k.p.kafka.KafkaJMXClient : method=getClientWithCheck||clusterPhyId=1||brokerId=1||msg=get jmx connector failed!

https://github.com/didi/KnowStreaming/issues/1007 和这个的原因一样。

menghe999 commented 1 year ago

问题解决了,原因时因为es分片满了,参考集群Shard满 @ZQKC 感谢