apache / incubator-seata

:fire: Seata is an easy-to-use, high-performance, open source distributed transaction solution.
https://seata.apache.org/
Apache License 2.0
25.38k stars 8.79k forks source link

file&raft doGlobalRollback and doGlobalCommit may have concurrency issues with retry tasks #7004

Open funky-eyes opened 2 weeks ago

funky-eyes commented 2 weeks ago

Ⅰ. Issue Description

908毫秒的时候全局事务已经在另一个线程里完成,910的时候还在回滚其中一个分支,明显的并行回滚导致,并且日志里回滚了2次6882078649837270974,该事务为一个时间较长的事务,时间大概为4分钟,导致定时任务会自动将rollbacking超过2分10秒的任务拉起来重试,而此时整好决议,所以会出现并发性回滚,而在raft下由于并发,会导致对应的globalsession已经被删除了,而接着发了一个branchsession操作相关的同步消息,导致出现npe At 908 milliseconds, the global transaction was already completed in another thread, while at 910 milliseconds, one of the branches was still being rolled back. This clearly indicates parallel rollbacks. Additionally, the log shows that the transaction with ID 6882078649837270974 was rolled back twice. This transaction was a long-running one, lasting approximately 4 minutes, which caused the scheduled task to automatically retry tasks that had been in a 'rollbacking' state for more than 2 minutes and 10 seconds. By this time, the decision was already made, resulting in concurrent rollbacks. Under Raft, due to this concurrency, the corresponding global session was already deleted, and a branch session operation-related synchronization message was sent, which led to an NPE (NullPointerException)

2024-11-12 21:34:16.911 ERROR --- [JRaft-FSMCaller-Disruptor-0] [org.apache.seata.server.cluster.raft.RaftStateMachine] 
[onExecuteRaft] []: Message synchronization failure: Cannot invoke "org.apache.seata.server.session.GlobalSession.getBranch(long)" because "globalSession" is null, msgType: RELEASE_BRANCH_SESSION_LOCK
==>
java.lang.NullPointerException: Cannot invoke "org.apache.seata.server.session.GlobalSession.getBranch(long)" because "globalSession" is null
    at org.apache.seata.server.cluster.raft.execute.lock.BranchReleaseLockExecute.execute(BranchReleaseLockExecute.java:35)
    at org.apache.seata.server.cluster.raft.execute.lock.BranchReleaseLockExecute.execute(BranchReleaseLockExecute.java:28)
    at org.apache.seata.server.cluster.raft.RaftStateMachine.onExecuteRaft(RaftStateMachine.java:333)
    at org.apache.seata.server.cluster.raft.RaftStateMachine.onApply(RaftStateMachine.java:174)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:597)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:561)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:467)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
    at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:150)
    at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
    at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
    at java.base/java.lang.Thread.run(Thread.java:1583)
<==

2024-11-12 21:34:16.892 INFO --- [SyncProcessing_1_1] [org.apache.seata.server.coordinator.DefaultCore] [lambda$doGlobalRollback$3] [193.193.193.37:8097:6882078649837270973]: Rollback branch transaction successfully, xid = 193.193.193.37:8097:6882078649837270973 branchId = 6882078649837270974

2024-11-12 21:34:16.908 INFO --- [SyncProcessing_1_1] [org.apache.seata.server.coordinator.DefaultCore] [doGlobalRollback] [193.193.193.37:8097:6882078649837270973]: Rollback global transaction successfully, xid = 193.193.193.37:8097:6882078649837270973. 2024-11-12 21:34:16.910 INFO --- [ServerHandlerThread_1_19_500] [org.apache.seata.server.coordinator.DefaultCore] [lambda$doGlobalRollback$3] [193.193.193.37:8097:6882078649837270973]: Rollback branch transaction successfully, xid = 193.193.193.37:8097:6882078649837270973 branchId = 6882078649837270974

Ⅱ. Describe what happened

If there is an exception, please attach the exception trace:

Just paste your stack trace here!

Ⅲ. Describe what you expected to happen

Ⅳ. How to reproduce it (as minimally and precisely as possible)

  1. xxx
  2. xxx
  3. xxx

Minimal yet complete reproducer code (or URL to code):

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

funky-eyes commented 1 week ago

7005 修复了并发出现的raft npe的问题,而二阶段的重试和决议可能会同时进行的问题还没处理

原方案 1. 增加本地锁,该方案在存算一体的raft和file下可解决,但是由于这种低概率事件而悲观的上锁,会导致不必要的性能损耗,并且在db和redis下依然无效 方案2. 增加动态的事务补偿时间,长事务的deadtime(认为事务在rollbaking和committing状态发生异常等需要异步任务进行补偿的时间间隔,默认2分10秒)可以通过globaltransactional注解进行指定每个事务粒度级别的deadtime,避免并发。(目前有全局可配置的server.retryDeadThreshold进行配置,但是粒度不够细),但是该方案缺点就是在db存储模式下需要增加表列,用户必须按照新的表接口变更后再升级server 方案3:共识算法,raft+db/redis等其它存储模式,来感知决议节点是否已下线,补偿任务仅补偿对应xid的server已经下线的rollbacking状态的事务,因为对应xid的server在线,不应该再进行补偿该事务,因为如果同步的过程中出现异常,事务会changestatus,并不会保持在rollbacking,也就是如果xid对应的server存活,rollbacking只可能在存活节点上正在运行,而不需要补偿。

欢迎补充更多解决方案,与社区一同讨论。

7005 fixed the Raft NPE issue caused by concurrency, but the issue where two-phase retries and decisions might occur simultaneously has not been addressed yet.

Original Plan:

Add local locks: This solution resolves the issue in Raft with integrated storage and computation (such as Raft and file systems), but introducing pessimistic locking due to such low-probability events can cause unnecessary performance overhead. Additionally, this solution remains ineffective in DB and Redis environments.

Add dynamic transaction compensation time: The deadtime for long transactions (defined as the interval during which a transaction might encounter anomalies in the rollback or committing state and needs an asynchronous compensation task) is set to 2 minutes and 10 seconds by default. The globaltransactional annotation can be used to specify the deadtime for each transaction at a granular level to avoid concurrency. (Currently, there is a globally configurable server.retryDeadThreshold, but its granularity is insufficient.) However, the drawback of this solution is that, in DB storage mode, it requires adding new table columns, and users must update the server after modifying the table interface.

Consensus algorithm (Raft + DB/Redis and other storage modes): This solution involves using Raft and storage modes like DB/Redis to detect whether the decision-making node is offline. The compensation task should only compensate for transactions in the "rollbacking" state on servers corresponding to an offline xid. If the server corresponding to the xid is online, no compensation is needed for that transaction, because if an exception occurs during synchronization, the transaction will change its status and will not remain in the "rollbacking" state. Therefore, if the server corresponding to the xid is alive, the "rollbacking" state can only be running on the live node, and no compensation is required.

Welcome to add more solutions and discuss them with the community.