apache / incubator-seata

:fire: Seata is an easy-to-use, high-performance, open source distributed transaction solution.
https://seata.apache.org/
Apache License 2.0
25.21k stars 8.76k forks source link

使用2.0.0的raft模式集群,客户端的RM和TM连接成功后一直报Decode frame error #6532

Closed zacharias1989 closed 2 months ago

zacharias1989 commented 4 months ago

Ⅰ. Issue Description

在k8s环境下使用官方的2.0.0-slim镜像创建了raft模式集群,使用业务系统客户端正常连接,seata server日志显示RM和TM都register success,client和server的version都是2.0.0。然后server就会不断报Decode frame error, cause: Adjusted frame length exceeds 8388608: 1411395437 - discarded,client端会对应报read timed out。

Ⅱ. Describe what happened

If there is an exception, please attach the exception trace: 11:44:24.357 INFO --- [rverHandlerThread_1_1_500] [rocessor.server.RegRmProcessor] [ onRegRmMessage] [] : RM register success,message:RegisterRMRequest{resourceIds='jdbc:mysql://192.168.0.162:3306/test', version='2.0.0', applicationId='test-service', transactionServiceGroup='default_tx_group', extraData='null'},channel:[id: 0x3fd908a9, L:/10.0.0.40:8091 - R:/10.0.0.41:55964],client version:2.0.0 11:44:51.374 ERROR --- [ettyServerNIOWorker_1_1_2] [rpc.netty.v1.ProtocolV1Decoder] [ decode] [] : Decode frame error, cause: Adjusted frame length exceeds 8388608: 1411395437 - discarded 11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ userEventTriggered] [] : channel:[id: 0xc8556689, L:/10.0.0.40:8091 - R:/10.0.0.41:56324] read idle. 11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56324 to server channel inactive. 11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : remove unused channel:[id: 0xc8556689, L:/10.0.0.40:8091 - R:/10.0.0.41:56324] 11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [hannelHandlerContext] [] : closeChannelHandlerContext channel:[id: 0xc8556689, L:/10.0.0.40:8091 - R:/10.0.0.41:56324] 11:45:06.379 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56324 to server channel inactive. 11:45:06.380 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : remove unused channel:[id: 0xc8556689, L:/10.0.0.40:8091 ! R:/10.0.0.41:56324] 11:45:07.383 ERROR --- [ettyServerNIOWorker_1_2_2] [rpc.netty.v1.ProtocolV1Decoder] [ decode] [] : Decode frame error, cause: Adjusted frame length exceeds 8388608: 539979109 - discarded 11:45:08.384 INFO --- [ettyServerNIOWorker_1_2_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56516 to server channel inactive. 11:45:08.384 INFO --- [ettyServerNIOWorker_1_2_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : remove unused channel:[id: 0xf3946fc7, L:0.0.0.0/0.0.0.0:8091 ! R:/10.0.0.41:56516] 11:45:24.398 ERROR --- [ettyServerNIOWorker_1_1_2] [rpc.netty.v1.ProtocolV1Decoder] [ decode] [] : Decode frame error, cause: Adjusted frame length exceeds 8388608: 1411395437 - discarded 11:45:39.399 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ userEventTriggered] [] : channel:[id: 0x86a2d5a8, L:/10.0.0.40:8091 - R:/10.0.0.41:56708] read idle. 11:45:39.400 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56708 to server channel inactive.

Ⅲ. Describe what you expected to happen

按理说TM和RM已经注册成功,说明配置没有问题,什么也没有做,不应该出现这些错误

Ⅳ. How to reproduce it (as minimally and precisely as possible)

  1. xxx
  2. xxx
  3. xxx

Minimal yet complete reproducer code (or URL to code):

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

slievrly commented 4 months ago

Whether the health check for port 8091 is configured. Is data consistency normal?

zacharias1989 commented 4 months ago

Whether the health check for port 8091 is configured. Is data consistency normal?

什么健康检查?看官方文档seata不是只有个空的心跳包吗?为什么会有健康检查?

zacharias1989 commented 4 months ago

Whether the health check for port 8091 is configured. Is data consistency normal?

我没有做任何额外的健康检查设置。从前端的日志看出,这应该是client端发起的watch行为,但因为包太大被服务端丢弃了,导致client端提示超时。 2024-05-10 19:19:25.817 ERROR 1 --- [eshMetadata_1_1] i.s.d.r.raft.RaftRegistryServiceImpl : watch cluster node: 10.0.0.146:8091, fail: 10.0.0.146:8091 failed to respond 2024-05-10 19:19:41.823 ERROR 1 --- [eshMetadata_1_1] i.s.d.r.raft.RaftRegistryServiceImpl : watch cluster node: 10.0.0.145:8091, fail: 10.0.0.145:8091 failed to respond 2024-05-10 19:19:43.827 ERROR 1 --- [eshMetadata_1_1] i.s.d.r.raft.RaftRegistryServiceImpl : Read timed out

java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.8.0_212] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[na:1.8.0_212] at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[na:1.8.0_212] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[na:1.8.0_212] at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar!/:4.5.13] at io.seata.common.util.HttpClientUtil.doGet(HttpClientUtil.java:120) ~[seata-all-2.0.0.jar!/:2.0.0] at io.seata.discovery.registry.raft.RaftRegistryServiceImpl.acquireClusterMetaData(RaftRegistryServiceImpl.java:294) [seata-all-2.0.0.jar!/:2.0.0] at io.seata.discovery.registry.raft.RaftRegistryServiceImpl.lambda$null$0(RaftRegistryServiceImpl.java:153) [seata-all-2.0.0.jar!/:2.0.0] at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[na:1.8.0_212] at java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3527) ~[na:1.8.0_212] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[na:1.8.0_212] at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) ~[na:1.8.0_212] at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) ~[na:1.8.0_212] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[na:1.8.0_212] at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) ~[na:1.8.0_212] at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) ~[na:1.8.0_212] at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) ~[na:1.8.0_212] at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174) ~[na:1.8.0_212] at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) ~[na:1.8.0_212] at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[na:1.8.0_212] at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) ~[na:1.8.0_212] at io.seata.discovery.registry.raft.RaftRegistryServiceImpl.lambda$startQueryMetadata$1(RaftRegistryServiceImpl.java:151) [seata-all-2.0.0.jar!/:2.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.77.Final.jar!/:4.1.77.Final] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]

zacharias1989 commented 4 months ago

rpc.netty.v1.ProtocolV1Decoder

看到seata-server的raft集群日志有一个 WARN --- [ main] [ay.sofa.jraft.RaftGroupService] [ start] [] : RPC server is not started in RaftGroupService. 不知道有没有影响

zacharias1989 commented 4 months ago

服务端使用k8s的Statefulset配置如下: apiVersion: apps/v1 kind: StatefulSet metadata: name: seata-raft namespace: seata spec: podManagementPolicy: Parallel serviceName: seata-headless replicas: 3 template: metadata: labels: app: seata-raft-svc env: prod annotations: pod.alpha.kubernetes.io/initialized: "true" spec: affinity: podAntiAffinity:

不建议调度到同一个node上

      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100 # 优先级最高
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: "app"
                  operator: In
                  values:
                    - seata-raft-svc
            topologyKey: "kubernetes.io/hostname"
  volumes:
    - name: host-time
      hostPath:
        path: /etc/localtime
        type: ''
    - name: seata-raft-cm
      configMap:
        name: seata-raft-cm
        items:
          - key: application.yml
            path: application.yml
        defaultMode: 420
  containers:
    - name: seata-raft
      imagePullPolicy: IfNotPresent
      image: docker.io/seataio/seata-server:2.0.0-slim
      resources: {}
      ports:
        - name: server
          containerPort: 7091
          protocol: TCP
        - name: cluster
          containerPort: 8091
          protocol: TCP
      env:

- name: seata.server.raft.server-addr

        - name: SEATA_SERVER_RAFT_SERVER_ADDR
          value: seata-raft-0.seata-headless.seata.svc.cluster.local:9091,seata-raft-1.seata-headless.seata.svc.cluster.local:9091,seata-raft-2.seata-headless.seata.svc.cluster.local:9091
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SEATA_IP
          value: $(POD_NAME).seata-headless.seata.svc.cluster.local
      volumeMounts:
        - name: host-time
          mountPath: /etc/localtime
          readOnly: true
        - name: seata-raft-cm
          readOnly: true
          mountPath: /seata-server/resources/application.yml
          subPath: application.yml
        - name: data
          mountPath: /seata-server/sessionStore
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File

volumeClaimTemplates:

zacharias1989 commented 4 months ago

客户端配置文件为: server: port: 8080

seata: enabled: true registry: type: raft raft: server-addr: ${SEATA_SERVER_RAFT_SERVER_ADDR} #真实值为seata-raft-0.seata-headless.seata.svc.cluster.local:7091,seata-raft-1.seata-headless.seata.svc.cluster.local:7091,seata-raft-2.seata-headless.seata.svc.cluster.local:7091 tx-service-group: default_tx_group service: vgroup-mapping: default_tx_group: default application-id: ${spring.application.name}

funky-eyes commented 4 months ago

https://github.com/apache/incubator-seata/blob/v2.0.0/discovery/seata-discovery-raft/src/main/java/io/seata/discovery/registry/raft/RaftRegistryServiceImpl.java queryHttpAddress 方法debug,并截图下addressList的值内容

funky-eyes commented 4 months ago

String host = inetSocketAddress.getAddress().getHostAddress(); 这段代码使用上有问题,当inetSocketAddress是通过new InetSocketAddress(域名,port),通过getHostAddress会获取到解析后的ip,导致出现这个issue所说的问题。 The code has an issue in its usage. When inetSocketAddress is created using new InetSocketAddress(hostname, port), invoking getHostAddress retrieves the resolved IP address, causing the problem described in this issue.

funky-eyes commented 4 months ago

String host = inetSocketAddress.getAddress().getHostAddress(); 这段代码使用上有问题,当inetSocketAddress是通过new InetSocketAddress(域名,port),通过getHostAddress会获取到解析后的ip,导致出现这个issue所说的问题。 The code has an issue in its usage. When inetSocketAddress is created using new InetSocketAddress(hostname, port), invoking getHostAddress retrieves the resolved IP address, causing the problem described in this issue.

直接改成getHoststring是无效的,因为raft这块给出去的是node转换成InetSocketAddress(域名,端口)->然后netty那边转成string做了域名解析变成了InetSocketAddress(ip,port)->健康检查->归还了解析后的InetSocketAddress 直接使用getHoststring得出来的可能是一个ip,无法对比node里的host,匹配不上就走8091端口了,这块涉及有点多。 方案1:将raft实现中的queryHttpAddress时对比逻辑改为创建成InetSocketAddress,然后都通过getHostAddress来对比 方案2:将NetUtil中的toStringAddress换为getHoststring的方式,不要域名解析,然后raft中的实现可以直接用getHoststring和metadata中的node里的host做对比,这块就是域名就是域名,如果raft集群是ip方式组件就是ip,不会涉及域名解析之类的也更高效,也更统一