Closed zacharias1989 closed 2 months ago
Whether the health check for port 8091 is configured. Is data consistency normal?
Whether the health check for port 8091 is configured. Is data consistency normal?
什么健康检查?看官方文档seata不是只有个空的心跳包吗?为什么会有健康检查?
Whether the health check for port 8091 is configured. Is data consistency normal?
我没有做任何额外的健康检查设置。从前端的日志看出,这应该是client端发起的watch行为,但因为包太大被服务端丢弃了,导致client端提示超时。 2024-05-10 19:19:25.817 ERROR 1 --- [eshMetadata_1_1] i.s.d.r.raft.RaftRegistryServiceImpl : watch cluster node: 10.0.0.146:8091, fail: 10.0.0.146:8091 failed to respond 2024-05-10 19:19:41.823 ERROR 1 --- [eshMetadata_1_1] i.s.d.r.raft.RaftRegistryServiceImpl : watch cluster node: 10.0.0.145:8091, fail: 10.0.0.145:8091 failed to respond 2024-05-10 19:19:43.827 ERROR 1 --- [eshMetadata_1_1] i.s.d.r.raft.RaftRegistryServiceImpl : Read timed out
java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.8.0_212] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[na:1.8.0_212] at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[na:1.8.0_212] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[na:1.8.0_212] at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.15.jar!/:4.4.15] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar!/:4.5.13] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar!/:4.5.13] at io.seata.common.util.HttpClientUtil.doGet(HttpClientUtil.java:120) ~[seata-all-2.0.0.jar!/:2.0.0] at io.seata.discovery.registry.raft.RaftRegistryServiceImpl.acquireClusterMetaData(RaftRegistryServiceImpl.java:294) [seata-all-2.0.0.jar!/:2.0.0] at io.seata.discovery.registry.raft.RaftRegistryServiceImpl.lambda$null$0(RaftRegistryServiceImpl.java:153) [seata-all-2.0.0.jar!/:2.0.0] at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[na:1.8.0_212] at java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3527) ~[na:1.8.0_212] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[na:1.8.0_212] at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) ~[na:1.8.0_212] at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) ~[na:1.8.0_212] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[na:1.8.0_212] at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) ~[na:1.8.0_212] at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) ~[na:1.8.0_212] at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) ~[na:1.8.0_212] at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174) ~[na:1.8.0_212] at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) ~[na:1.8.0_212] at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[na:1.8.0_212] at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) ~[na:1.8.0_212] at io.seata.discovery.registry.raft.RaftRegistryServiceImpl.lambda$startQueryMetadata$1(RaftRegistryServiceImpl.java:151) [seata-all-2.0.0.jar!/:2.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.77.Final.jar!/:4.1.77.Final] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]
rpc.netty.v1.ProtocolV1Decoder
看到seata-server的raft集群日志有一个 WARN --- [ main] [ay.sofa.jraft.RaftGroupService] [ start] [] : RPC server is not started in RaftGroupService. 不知道有没有影响
服务端使用k8s的Statefulset配置如下: apiVersion: apps/v1 kind: StatefulSet metadata: name: seata-raft namespace: seata spec: podManagementPolicy: Parallel serviceName: seata-headless replicas: 3 template: metadata: labels: app: seata-raft-svc env: prod annotations: pod.alpha.kubernetes.io/initialized: "true" spec: affinity: podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100 # 优先级最高
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- seata-raft-svc
topologyKey: "kubernetes.io/hostname"
volumes:
- name: host-time
hostPath:
path: /etc/localtime
type: ''
- name: seata-raft-cm
configMap:
name: seata-raft-cm
items:
- key: application.yml
path: application.yml
defaultMode: 420
containers:
- name: seata-raft
imagePullPolicy: IfNotPresent
image: docker.io/seataio/seata-server:2.0.0-slim
resources: {}
ports:
- name: server
containerPort: 7091
protocol: TCP
- name: cluster
containerPort: 8091
protocol: TCP
env:
- name: SEATA_SERVER_RAFT_SERVER_ADDR
value: seata-raft-0.seata-headless.seata.svc.cluster.local:9091,seata-raft-1.seata-headless.seata.svc.cluster.local:9091,seata-raft-2.seata-headless.seata.svc.cluster.local:9091
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: SEATA_IP
value: $(POD_NAME).seata-headless.seata.svc.cluster.local
volumeMounts:
- name: host-time
mountPath: /etc/localtime
readOnly: true
- name: seata-raft-cm
readOnly: true
mountPath: /seata-server/resources/application.yml
subPath: application.yml
- name: data
mountPath: /seata-server/sessionStore
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeClaimTemplates:
apiVersion: v1 kind: Service metadata: name: seata-headless namespace: seata spec: publishNotReadyAddresses: true ports:
port: 8091 name: cluster targetPort: 8091 clusterIP: None selector: app: seata-raft-svc env: prod 配置文件内容如下: apiVersion: v1 kind: ConfigMap metadata: name: seata-raft-cm namespace: seata data: application.yml: | server: port: 7091
spring: application: name: seata-server
logging: config: classpath:logback-spring.xml file: path: ${log.home:${user.home}/logs/seata} extend: logstash-appender: destination: 127.0.0.1:4560 kafka-appender: bootstrap-servers: 127.0.0.1:9092 topic: logback_to_logstash
console: user: username: seata password: seata seata: server: raft: group: default #此值代表该raft集群的group,client的事务分支对应的值要与之对应 server-addr: ${SEATA_SERVER_RAFT_SERVER_ADDR} # 真实值为seata-raft-0.seata-headless.seata.svc.cluster.local:9091,seata-raft-1.seata-headless.seata.svc.cluster.local:9091,seata-raft-2.seata-headless.seata.svc.cluster.local:9091 snapshot-interval: 600 # 600秒做一次数据的快照,以便raftlog的快速滚动,但是每次做快照如果内存中事务数据过多会导致每600秒产生一次业务rt的抖动,但是对于故障恢复比较友好,重启节点较快,可以调整为30分钟,1小时都行,具体按业务来,可以自行压测看看是否有抖动,在rt抖动和故障恢复中自行找个平衡点 apply-batch: 32 # 最多批量32次动作做一次提交raftlog max-append-bufferSize: 262144 #日志存储缓冲区最大大小,默认256K max-replicator-inflight-msgs: 256 #在启用 pipeline 请求情况下,最大 in-flight 请求数,默认256 disruptor-buffer-size: 16384 #内部 disruptor buffer 大小,如果是写入吞吐量较高场景,需要适当调高该值,默认 16384 election-timeout-ms: 1000 #超过多久没有leader的心跳开始重选举 reporter-enabled: false # raft自身的监控是否开启 reporter-initial-delay: 60 # 监控的区间间隔 serialization: jackson # 序列化方式,不要改动 compressor: none # raftlog的压缩方式,如gzip,zstd等 sync: true # raft日志的刷盘方式,默认是同步刷盘 config:
type: file
registry:
type: file
store:
mode: raft
file:
dir: sessionStore
security: secretKey: xxxx tokenValidityInMilliseconds: 1800000 ignore: urls: /,//*.css,/*/.js,//*.html,//*.map,/*/.svg,//*.png,//*.jpeg,/*/.ico,/api/v1/auth/login,/metadata/v1/
客户端配置文件为: server: port: 8080
seata: enabled: true registry: type: raft raft: server-addr: ${SEATA_SERVER_RAFT_SERVER_ADDR} #真实值为seata-raft-0.seata-headless.seata.svc.cluster.local:7091,seata-raft-1.seata-headless.seata.svc.cluster.local:7091,seata-raft-2.seata-headless.seata.svc.cluster.local:7091 tx-service-group: default_tx_group service: vgroup-mapping: default_tx_group: default application-id: ${spring.application.name}
https://github.com/apache/incubator-seata/blob/v2.0.0/discovery/seata-discovery-raft/src/main/java/io/seata/discovery/registry/raft/RaftRegistryServiceImpl.java queryHttpAddress 方法debug,并截图下addressList的值内容
String host = inetSocketAddress.getAddress().getHostAddress();
这段代码使用上有问题,当inetSocketAddress是通过new InetSocketAddress(域名,port),通过getHostAddress会获取到解析后的ip,导致出现这个issue所说的问题。
The code has an issue in its usage. When inetSocketAddress is created using new InetSocketAddress(hostname, port), invoking getHostAddress retrieves the resolved IP address, causing the problem described in this issue.
String host = inetSocketAddress.getAddress().getHostAddress();
这段代码使用上有问题,当inetSocketAddress是通过new InetSocketAddress(域名,port),通过getHostAddress会获取到解析后的ip,导致出现这个issue所说的问题。 The code has an issue in its usage. When inetSocketAddress is created using new InetSocketAddress(hostname, port), invoking getHostAddress retrieves the resolved IP address, causing the problem described in this issue.
直接改成getHoststring是无效的,因为raft这块给出去的是node转换成InetSocketAddress(域名,端口)->然后netty那边转成string做了域名解析变成了InetSocketAddress(ip,port)->健康检查->归还了解析后的InetSocketAddress 直接使用getHoststring得出来的可能是一个ip,无法对比node里的host,匹配不上就走8091端口了,这块涉及有点多。 方案1:将raft实现中的queryHttpAddress时对比逻辑改为创建成InetSocketAddress,然后都通过getHostAddress来对比 方案2:将NetUtil中的toStringAddress换为getHoststring的方式,不要域名解析,然后raft中的实现可以直接用getHoststring和metadata中的node里的host做对比,这块就是域名就是域名,如果raft集群是ip方式组件就是ip,不会涉及域名解析之类的也更高效,也更统一
Ⅰ. Issue Description
在k8s环境下使用官方的2.0.0-slim镜像创建了raft模式集群,使用业务系统客户端正常连接,seata server日志显示RM和TM都register success,client和server的version都是2.0.0。然后server就会不断报Decode frame error, cause: Adjusted frame length exceeds 8388608: 1411395437 - discarded,client端会对应报read timed out。
Ⅱ. Describe what happened
If there is an exception, please attach the exception trace: 11:44:24.357 INFO --- [rverHandlerThread_1_1_500] [rocessor.server.RegRmProcessor] [ onRegRmMessage] [] : RM register success,message:RegisterRMRequest{resourceIds='jdbc:mysql://192.168.0.162:3306/test', version='2.0.0', applicationId='test-service', transactionServiceGroup='default_tx_group', extraData='null'},channel:[id: 0x3fd908a9, L:/10.0.0.40:8091 - R:/10.0.0.41:55964],client version:2.0.0 11:44:51.374 ERROR --- [ettyServerNIOWorker_1_1_2] [rpc.netty.v1.ProtocolV1Decoder] [ decode] [] : Decode frame error, cause: Adjusted frame length exceeds 8388608: 1411395437 - discarded 11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ userEventTriggered] [] : channel:[id: 0xc8556689, L:/10.0.0.40:8091 - R:/10.0.0.41:56324] read idle. 11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56324 to server channel inactive. 11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : remove unused channel:[id: 0xc8556689, L:/10.0.0.40:8091 - R:/10.0.0.41:56324] 11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [hannelHandlerContext] [] : closeChannelHandlerContext channel:[id: 0xc8556689, L:/10.0.0.40:8091 - R:/10.0.0.41:56324] 11:45:06.379 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56324 to server channel inactive. 11:45:06.380 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : remove unused channel:[id: 0xc8556689, L:/10.0.0.40:8091 ! R:/10.0.0.41:56324] 11:45:07.383 ERROR --- [ettyServerNIOWorker_1_2_2] [rpc.netty.v1.ProtocolV1Decoder] [ decode] [] : Decode frame error, cause: Adjusted frame length exceeds 8388608: 539979109 - discarded 11:45:08.384 INFO --- [ettyServerNIOWorker_1_2_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56516 to server channel inactive. 11:45:08.384 INFO --- [ettyServerNIOWorker_1_2_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : remove unused channel:[id: 0xf3946fc7, L:0.0.0.0/0.0.0.0:8091 ! R:/10.0.0.41:56516] 11:45:24.398 ERROR --- [ettyServerNIOWorker_1_1_2] [rpc.netty.v1.ProtocolV1Decoder] [ decode] [] : Decode frame error, cause: Adjusted frame length exceeds 8388608: 1411395437 - discarded 11:45:39.399 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ userEventTriggered] [] : channel:[id: 0x86a2d5a8, L:/10.0.0.40:8091 - R:/10.0.0.41:56708] read idle. 11:45:39.400 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56708 to server channel inactive.
Ⅲ. Describe what you expected to happen
按理说TM和RM已经注册成功,说明配置没有问题,什么也没有做,不应该出现这些错误
Ⅳ. How to reproduce it (as minimally and precisely as possible)
Minimal yet complete reproducer code (or URL to code):
Ⅴ. Anything else we need to know?
Ⅵ. Environment:
java -version
):uname -a
):