apache / incubator-seata

:fire: Seata is an easy-to-use, high-performance, open source distributed transaction solution.
https://seata.apache.org/
Apache License 2.0
25.19k stars 8.75k forks source link

BUG - TCC 二阶段 commit 超时必出异常,导致客户端无法重连服务端。 #2624

Closed WindyDovs closed 4 years ago

WindyDovs commented 4 years ago

Ⅰ. Issue Description

Seata 版本: V1.1.0 场景是模拟 commit 超时的情况,让 commit 跑 61 秒,期间 server 端会进行重发 commit 请求,重发的请求因为获取不到数据库锁,会报锁超时, commit 失败;等到第一个 commit 请求 61 秒等待结束后,提交数据,释放数据锁,再等到下次 commit 请求过来时,幂等返回结果,二阶段提交成功。

但在这个超时过程中,会发生多种异常,异常堆栈见下方,异常过后,客户端与服务端连接出现问题,所有请求均超时,无法再进行全局事务请求,需要重启客户端才可以恢复正常。

Ⅱ. Describe what happened

异常如下: io.seata.core.rpc.netty.AbstractRpcRemoting - wait response error:cost 30004ms,ip:xxxxxx:8091,request:xid=xxxxxx:8091:2041601441,extraData=null io.seata.tm.api.DefaultGlobalTransaction - Failed to report global commit [全局事务ID], Retry Countdown: 5, reason: RPC timeout

堆栈用了图片转文字,可能有点奇怪,但大概如下: 堆栈一: java. lang. IllegalArqumentExc2020-04-27 18:47:03.295 -ERROR [NettyClientSelector TMROLE 1] seata corelrboe netly 1 protocoivneclttluxception: not support typeCode,-9510io.seata.core.rpc.netty.v1.ProtocolV1Decoder - Decode frame error!

  at io.seata.serializer .seata .MessageCodecFactory getMergeResponseInstanceByCode (MessageCodecFactory java:342)

at io.seata.serializer.seata. Seataseriai0.seata. serializer. seata.MessageCodecFactory . getMessage (lizer.deserialize (SeataSerializer.java:78)(MessageCodecFactory-java:281)at io.seata.core.rpc.netty.v1.ProtocollDecoder .decodeFrame (ProtocolVlDecoder.java:141)1t io.seata.core. rpc.netty.v1. ProtocolV1Decoder . decode (ProtocolVlDecoder.java:86)

  t io.netty. handler . codec . LengthFieldBasedFrameDecoder. decode (LengthFieldBasedFrameDecoder -java :343)

  at i0-netty handler . codec.ByteToMessageDecoder .decodeRemovalReentryProtection (ByteToMssageDecoder. java:489)at io.netty.handler.codec.ByteToMessageDecoder . callDecode (ByteToMessageDecoder .java:428)

  io. netty handler.codec.ByteToMessageDecoder . channelRead (ByteToMessageDecoder.java:265)ioneteo

  channel . AbstractChannelHandlerContext . invo keChannelRead (AbstractChannelHandlerContext.java:362)1t io.netty.channel.AbstractchannelHandlerContext . invokeChannelRead (AbstractChannelHandlerContextijava:348)ati0.netty channel. AbstractChannelHandlerContext. fireChannelRead (AbstractChannelHandlerContext.java:340)t io.netty. handler. timeout.IdlestateHandler.channelRead (IdlestateHandler.java:286)

  at io.netty. channel. bstractChannelHandlerContext. invokeChannelRead (AbstractChannelHandlerContext . java:362)t io.netty-channel.AbstractchannelHandlerontext.invokehannelRead (AbstractChanne lHandlerContext .java:348)t io.netty channel.AbstractChannelHandlerContext. fireChannelRead (AbstractChannelHandlerContextjava:340)t io.netty. channel. DefaultChannelPipelinesHeadcontext . channelRead (DefaultChannelPipeline.java: 1414)

  t io.netty.channel.bstractChannelHandlerContext invokeChannelRead (AbstractChannelHandlerContext . java:362)io.netty channel. AbstractChannelHandlerContext. invokechannelRead (AbstractchannelHandlerContext. java: 348)

  o.netty. channel. DefaultChannelPipeline. fireChannelRead (De faultChannelPipeline.java:945)

  t io. netty. channel. nio. AbstractNioByteChannel$NioByteUnsafe read(AbstractNioByteChannel. java:146)atio.netty- channel.nio NioEventLoop . processSelectedKey (NioEventLoop.java:645)

  t io.netty. channel. nio. NioEventLoop. processSelectedKeysOptimi zed (NioEventLoop.java:580)tio.netty . channel.nio.NioEventLoop . processSelectedKeys (NioEventLoop.java:497)tio.netty . channel.nio.NioEventLoop . run (NioEventLoop. java:459)

  at io. netty.util.concurrent. SingleThreadEventExecutor$5. run (SingleThreadEventExecutor -java:886)atio. netty.ut il.concurrent. FastThreadLocalRunnable . run (FastThreadLocal Runnable.java:30)tjava.lang Thread. run(Thread.java:748)

堆栈二:

2020-04-27 18:47:03.302 -ERROR [NettyClientSelector TMROLE 1] io.seata. core. rpc. netty .AbstractRpcRemotingClient [][][] [] [] 0318 io netty. handler. codec . DecoderException: java. lang . IllegalArgumentException: not support typeCode,-9510

  at io.netty. handler. codec. ByteToMessageDecoder . callDecode (ByteToMessageDecoder.java:459)t io.netty. handler. codec. ByteToMessageDecoder . channelRead (ByteToMessageDecoder . java:265)

  at io.netty. channel. AbstractChannelHandlerContext. invokeChannelRead (AbstractChannelHandlerContext.java:362)at io. netty. channel.AbstractChannelHandlerContext . invokeChannelRead (AbstractChannelHandlerContext.java :348)at io.netty. channel.AbstractchannelHandlerContext. fireChannelRead (AbstractChannelHandlerContext.java:340)at io.netty.handler. timeout.IdleStateHandler . channelRead (IdleStateHandler. java:286)

  at io. netty. channel .AbstractChannelHandlerContext. invokeChannelRead (AbstractChannelHandlerContext.java:362)at io.netty. channel.AbstractChannelHandlerContext . invokeChannelRead (AbstractChannelHandlerContext . java:348)at io.netty. channel.AbstractChannelHandlerContext. fireChannelRead (AbstractChannelHandlerContext.java:340)at io.netty channel.DefaultChannelPipeline$HeadContext . channelRead (DefaultChannelPipeline.java:1414)

  at io.netty. channel.AbstractChannelHandlerContext. invokeChannelRead (AbstractChannelHandlerContext.java:362)at io.netty. channel. AbstractChannelHandlerContext. invokeChannelRead (AbstractChannelHandlerContext. java:348)at io.netty. channel. DefaultChannelPipeline. fireChannelRead (DefaultChannelPipeline. java:945)

  at io.netty.channel. nio. AbstractNioByteChannel$NioByteUnsafe. read (AbstractNioByteChannel.java:146)at io.netty. channel. nio NioEventLoop .processSelectedKey (NioEventLoop .java:645)

  t io.netty.channel.nio.NioEventLoop processSelectedKeysOptimized (NioEventLoop.java:580)t io. netty.channel. nio.NioEventLoop .processSelectedKeys (NioEventLoop.java: 497)at io.netty channel. nio. NioEventLoop. run (NioEventLoop. java:459)

  at io.netty.util. concurrent . SingleThreadEventExecutor$5. run (SingleThreadEventExecutor.java:886)at io.netty.util. concurrent. FastThreadLocalRunnable . run (FastThreadLocalRunnable.java:30)at java. lang .Thread. run (Thread.java:748)

Caused by: java.lang. IllegalArgumentException: not support typeCode,-9510

  at io.seata.serializer.seata.MessageCodecFactory.getMergeResponseInstanceByCode (MessageCodecFactory. java:342)at ioseata.serializer. seata. MessageCodecFactory . getMessage (MessageCodecFactory.java:281)at io.seata.serializer. seata.SeataSerializer .deserialize (SeataSerializer.java:78)

at e io.seata. core. rpc. netty. v1. ProtocolV1Decoder .decodeF rame (ProtocolV1Decoder.java:141)

  at io.seata. core. rpc. netty.v1. ProtocolV1Decoder .decode (ProtocolV1Decoder.java:86)

at io.netty .handler. codec.LengthFieldBased rameDecoder decode (LengthFieldBasedFrameDecoder java:343)

at io.netty handler. codec.ByteToMessageDecoder . decodeRemovalReentryProtection (ByteToMessageDecoder.java:489)at io.netty. handler . codec. ByteToMessageDecoder. callDecode (ByteToMessageDecoder.java:428)

  20commonframes omitted

堆栈三: ERROR[NettyClientSelector_TMROLE_1]io.seata.core.rpc.netty.AbstractRpcRemotingClient 0318 io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 8388608: 3671720192- discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:522) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:500) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.exceededFrameLength(LengthFieldBasedFrameDecoder.java:387) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:430) at io.seata.core.rpc.netty.v1.ProtocolV1Decoder.decode(ProtocolV1Decoder.java:82) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)

Ⅲ. Describe what you expected to happen

Ⅳ. How to reproduce it (as minimally and precisely as possible)

  1. xxx
  2. xxx
  3. xxx

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

cmonkey commented 4 years ago

typeCode 怎么都出错了呢?

hermitcai commented 4 years ago

这种是不是可以理解成 程序阻塞61秒 阻塞过程中心跳肯定就超时了 超时注册中心就会挂起你这个服务了

bzywbk commented 4 years ago
  1. The server-side encoding exception is the root cause of the problem E.g: Some required fields are not set, such as GlobalStatus, BranchStatus in class AbstractGlobalEndResponse and AbstractBranchEndResponse(Initial value should be given when new)
  2. The server can send messages normally, and the client can also receive messages (Use LoggingHandler to easily verify that the client has received the message) . But when the decoding exception occurs, the client cannot decode the message normally. This is why you need to close the channel, otherwise you will not be able to recover.
  3. So when an exception occurs during decoding, the connection should be closed E.g:
    try {
    return decodeFrame(frame);
    } catch (Exception e) {
    LOGGER.error("Decode frame error!", e);
    ctx.channel().close();
    throw e;
    } finally {
    frame.release();
    }
  4. To sum up, some fields necessary for encoding should be assigned initial values, and the connection should be closed when an exception occurs during decoding.
  5. Issues 2431,1859, 1939, 2193, and 2458 are also caused by the same reason
WindyDovs commented 4 years ago
  1. The server-side encoding exception is the root cause of the problem E.g: Some required fields are not set, such as GlobalStatus, BranchStatus in class AbstractGlobalEndResponse and AbstractBranchEndResponse(Initial value should be given when new)
  2. The server can send messages normally, and the client can also receive messages (Use LoggingHandler to easily verify that the client has received the message) . But when the decoding exception occurs, the client cannot decode the message normally. This is why you need to close the channel, otherwise you will not be able to recover.
  3. So when an exception occurs during decoding, the connection should be closed E.g:
try {
    return decodeFrame(frame);
} catch (Exception e) {
    LOGGER.error("Decode frame error!", e);
    ctx.channel().close();
    throw e;
} finally {
    frame.release();
}
  1. To sum up, some fields necessary for encoding should be assigned initial values, and the connection should be closed when an exception occurs during decoding.
  2. Issues 2431,1859, 1939, 2193, and 2458 are also caused by the same reason

谢谢,这样做的确可以解决。 这个问题产生的实际原因是,因为超时,再上去提交时,返回时候,拼装报文时候出现空指针,所以出现了写了半包后报错的情况,那么netty的报文因此乱套了,后面就无法恢复了。 解决方案1是出错后,客户端关闭channel重连 解决方案2是服务端先把报文整体拼装好后,再一次 writeandflush 出去,拼装中途报错的话就忽略本次回复,保证链路中的包不乱套。

WindyDovs commented 4 years ago

typeCode 怎么都出错了呢?

是由于超时后,服务端返回结果时,拼装报文空指针导致的,因为报文是一个个字段 write 进去的,当报文中其中一个字段空指针后,就导致 channel 中出现永远无法恢复的半包,从而使链路无法再用。

hermitcai commented 4 years ago

挺有意思的一个场景了 后续再读

WindyDovs commented 4 years ago

这个问题我的修复方案是上面所说的方案二,“服务端先把报文整体拼装好后,再一次 writeandflush 出去,拼装中途报错的话就忽略本次回复,保证链路中的包不乱套。” 现正式关闭issue。

tkfntkfn commented 1 year ago

你好, 可以请教下方案二具体怎么操作的吗