alibaba / nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
https://nacos.io
Apache License 2.0
29.91k stars 12.77k forks source link

nacos添加skywalking的agent后异常 #11798

Closed JustUse closed 5 months ago

JustUse commented 5 months ago

nacos2.2.3单机版(容器部署)添加skywalking的agent-8.16后,nacos无法被spring程序正常访问了。我查看protocol-raft.log 发现好多加入集群信息,明明是单机模式,却一直尝试加入集群。但去掉 skywalking的agent-8.16就一切正常,请问是有什么东西和skywalking冲突了吗?

2024-03-05 17:26:25,793 INFO create raft group : naming_persistent_service_v2

2024-03-05 17:26:26,353 INFO create raft group : naming_instance_metadata

2024-03-05 17:26:26,430 INFO This Raft event changes : RaftEvent{groupId='naming_persistent_service_v2', leader='10.10.135.35:7848', term=1, raftClusterInfo=[10.10.135.35:7848]}

2024-03-05 17:26:26,497 INFO This Raft event changes : RaftEvent{groupId='naming_instance_metadata', leader='10.10.135.35:7848', term=1, raftClusterInfo=[10.10.135.35:7848]}

2024-03-05 17:26:26,508 INFO create raft group : naming_service_metadata

2024-03-05 17:26:26,629 INFO This Raft event changes : RaftEvent{groupId='naming_service_metadata', leader='10.10.135.35:7848', term=1, raftClusterInfo=[10.10.135.35:7848]}

2024-03-05 17:26:27,845 ERROR Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_instance_metadata at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:605) at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:498) at com.alibaba.nacos.core.distributed.raft.JRaftServer.registerSelfToCluster(JRaftServer.java:353) at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$0(JRaftServer.java:264) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2024-03-05 17:26:27,855 ERROR Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_service_metadata at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:605) at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:498) at com.alibaba.nacos.core.distributed.raft.JRaftServer.registerSelfToCluster(JRaftServer.java:353) at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$0(JRaftServer.java:264) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2024-03-05 17:26:27,855 ERROR Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_persistent_service_v2 at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:605) at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:498) at com.alibaba.nacos.core.distributed.raft.JRaftServer.registerSelfToCluster(JRaftServer.java:353) at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$0(JRaftServer.java:264) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

KomachiSion commented 5 months ago

Raft 在单例模式下,只有有自己一个节点,正常情况下会选自己为leader,但是选主的这个步骤还是存在的,一般情况下是自己和自己通信(具体需要看sofa-jraft在这种情况下的实现),加入agent之后,不清楚agent对哪个类做了增强,导致可能部分地方的逻辑不符合预期了。 这个具体就得你排查一下了。看看有没有其他日志能够提供帮助。

JustUse commented 5 months ago

我看了之前的issue:https://github.com/alibaba/nacos/issues/11422, 通过升级nacos-server的jraft-core.version 至1.3.14可以解决客户端无法访问nacos-server的问题。

但新的问题又出现了 :(

2024-03-07 08:36:11,394 INFO Start the RaftGroupService successfully.

2024-03-07 08:36:11,398 INFO onLeaderStart: term=1.

2024-03-07 08:36:11,583 INFO Creating new channel to: 10.10.226.168:7848.

2024-03-07 08:36:11,709 INFO The channel 10.10.226.168:7848 is in state: CONNECTING.

2024-03-07 08:36:12,593 ERROR Fail to connect 10.10.226.168:7848, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.871615505s. [closed=[], open=[[buffered_nanos=873532247, waiting_for_connection]]].

2024-03-07 08:36:12,593 ERROR Fail to connect 10.10.226.168:7848, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.871614245s. [closed=[], open=[[buffered_nanos=877920062, waiting_for_connection]]].

2024-03-07 08:36:12,593 ERROR Fail to connect peer 10.10.226.168:7848 to get leader for group naming_service_metadata.

2024-03-07 08:36:12,593 ERROR Fail to connect peer 10.10.226.168:7848 to get leader for group naming_instance_metadata.

2024-03-07 08:36:12,613 INFO The channel 10.10.226.168:7848 is in state: READY.

2024-03-07 08:36:12,613 INFO The channel 10.10.226.168:7848 has successfully established.

2024-03-07 08:36:12,691 ERROR Fail to connect 10.10.226.168:7848, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.871613875s. [closed=[], open=[[buffered_nanos=886196276, remote_addr=10.10.226.168/10.10.226.168:7848]]].

2024-03-07 08:36:12,692 ERROR Fail to connect peer 10.10.226.168:7848 to get leader for group naming_persistent_service_v2.

2024-03-07 08:36:13,661 WARN [GRPC] failed to send response.

java.lang.IllegalArgumentException: ContextSnapshot can't be null. at org.apache.skywalking.apm.agent.core.context.ContextManager.continued(ContextManager.java:163) at org.apache.skywalking.apm.plugin.grpc.v1.server.TracingServerCall.close(TracingServerCall.java:71) at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onCompleted(ServerCalls.java:395) at com.alipay.sofa.jraft.rpc.impl.GrpcServer$1.sendResponse(GrpcServer.java:154) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:55) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35) at com.alipay.sofa.jraft.rpc.impl.GrpcServer.lambda$null$1(GrpcServer.java:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2024-03-07 08:36:13,661 WARN [GRPC] failed to send response.

java.lang.IllegalArgumentException: ContextSnapshot can't be null. at org.apache.skywalking.apm.agent.core.context.ContextManager.continued(ContextManager.java:163) at org.apache.skywalking.apm.plugin.grpc.v1.server.TracingServerCall.close(TracingServerCall.java:71) at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onCompleted(ServerCalls.java:395) at com.alipay.sofa.jraft.rpc.impl.GrpcServer$1.sendResponse(GrpcServer.java:154) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:55) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35) at com.alipay.sofa.jraft.rpc.impl.GrpcServer.lambda$null$1(GrpcServer.java:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2024-03-07 08:36:13,697 WARN [GRPC] failed to send response.

java.lang.IllegalArgumentException: ContextSnapshot can't be null. at org.apache.skywalking.apm.agent.core.context.ContextManager.continued(ContextManager.java:163) at org.apache.skywalking.apm.plugin.grpc.v1.server.TracingServerCall.close(TracingServerCall.java:71) at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onCompleted(ServerCalls.java:395) at com.alipay.sofa.jraft.rpc.impl.GrpcServer$1.sendResponse(GrpcServer.java:154) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:55) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35) at com.alipay.sofa.jraft.rpc.impl.GrpcServer.lambda$null$1(GrpcServer.java:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

想请问一下如何排查是增强类导致的呢?有排查方向和方法吗?或者需要什么日志?我看了nacos-server的logs底下日志没有方向。是不是有可能是grpc冲突呢?

Raft 在单例模式下,只有有自己一个节点,正常情况下会选自己为leader,但是选主的这个步骤还是存在的,一般情况下是自己和自己通信(具体需要看sofa-jraft在这种情况下的实现),加入agent之后,不清楚agent对哪个类做了增强,导致可能部分地方的逻辑不符合预期了。 这个具体就得你排查一下了。看看有没有其他日志能够提供帮助。

KomachiSion commented 5 months ago

增强的日志肯定不在nacos,看下skywalking-agent有没有指定输出日志的位置, 如果没有就需要你根据现在的日志堆栈自行排查了, 只要不引入agent,nacos运行没有问题,就说明nacos自身的代码是正常的。

JustUse commented 5 months ago

nacos-server的jraft-core.vers

其实最怕本身没问题,加上skywalking-agent增强后出问题 :(
就好像之前升级nacos-server的jraft-core.version 至1.3.14才解决客户端连不上服务端问题。

KomachiSion commented 5 months ago

nacos-server的jraft-core.vers

其实最怕本身没问题,加上skywalking-agent增强后出问题 :( 就好像之前升级nacos-server的jraft-core.version 至1.3.14才解决客户端连不上服务端问题。

那增强上出了问题, nacos原生代码怎么知道增强了什么,增强的什么除了问题, 这个在nacos社区咨询有点强人所难了。 除非能够明确知道,因为什么地方的问题导致增强失败或者无法增强导致的问题,这样可以一起讨论方案。

不然你说增强了之后出问题,不增强没问题,那社区怎么办你解决, 毕竟增强代码都不在nacos里面。。。

KomachiSion commented 5 months ago

No more response from author, and current information is not enough to judge problem.