TCC模式遇到了奇怪的问题：提交重试超时后不会进入回滚流程

chenbihao commented 1 week ago

[x] I have searched the issues of this repository and believe that this is not a duplicate.

Ⅰ. Issue Description

TCC模式遇到了奇怪的问题，提交重试超时后不会进入回滚流程？可以往哪个方向排查？

Ⅱ. Describe what happened

写了个使用TCC模式的 SpringCloud 2023 demo，demo是由一个触发的Controller调用本地bean的方法，业务操作是标记删除某条数据，并且为这个表加上了locked字段作为锁定资源，后面附上代码；

发现当commit阶段抛出异常时，无法按照预期流程进行（commit抛出异常应该是触发回滚？）

触发与调整经历

最开始，我是遇到了个空指针异常，是由context.getActionContext抛出来的，排查后发现把@BusinessActionContextParameter写在实现的方法上就解决了

然后我正常测试TCC的流程，往commit流程塞入异常抛出，发现抛出异常时，会无限重试，于是我调整了server配置：

seata:
  server:
    max-commit-retry-timeout: 5000
    max-rollback-retry-timeout: 8000
    distributedLockExpireTime: 10000

然后继续尝试，后面发现每次调用的行为都很奇怪，清空seata server db 的表内容，重启相关服务，然后：

第一次请求，debug断点看到，直接跑了Cancel回滚方法
第二次请求，debug断点看到，是跑了commit提交方法，经过几次重试后就 “提交重试超时失败”的状态了
一致循环下去

翻了些文章，看到了【在 Seata1.5.1 版本中，增加了一张事务控制表，表名是 tcc_fence_log 来解决这个问题】

于是我建了这个表，并且加上注解属性：useTCCFence = true, ，也还是不行，重试超时后也不会进入回滚方法

可以看出我哪步操作有问题吗？或者可以往哪个方向排查？

funky-eyes commented 1 week ago

建议先了解下TCC事务模式和两阶段提交协议

第一次进行了cancel,说明你的debug行为导致一阶段的timeout,或者rpc异常,事务决议为了回滚状态
第二次事务重试超时就是为了防止无限重试的问题,当重试停止就应该人工介入处理这个事务
两阶段提交,决议后的结果不可能因为二阶段的行为而变更,否者就存在前后多个分支事务的行为不一致,这就不能达到分布式事务效果.

It is advisable to first understand the TCC transaction model and the two-phase commit protocol.

The first cancellation indicates that your debugging actions caused a timeout in the first phase or an RPC exception, leading to a rollback decision for the transaction. The second transaction retry timeout is designed to prevent infinite retries. When the retry stops, manual intervention should be taken to handle the transaction. In a two-phase commit, the result after the decision cannot be changed by the behavior of the second phase; otherwise, there would be inconsistencies due to multiple branch transactions, which would undermine the effectiveness of distributed transactions.

chenbihao commented 1 week ago

建议先了解下TCC事务模式和两阶段提交协议

第一次进行了cancel,说明你的debug行为导致一阶段的timeout,或者rpc异常,事务决议为了回滚状态

第二次事务重试超时就是为了防止无限重试的问题,当重试停止就应该人工介入处理这个事务

两阶段提交,决议后的结果不可能因为二阶段的行为而变更,否者就存在前后多个分支事务的行为不一致,这就不能达到分布式事务效果.

It is advisable to first understand the TCC transaction model and the two-phase commit protocol.

The first cancellation indicates that your debugging actions caused a timeout in the first phase or an RPC exception, leading to a rollback decision for the transaction. The second transaction retry timeout is designed to prevent infinite retries. When the retry stops, manual intervention should be taken to handle the transaction. In a two-phase commit, the result after the decision cannot be changed by the behavior of the second phase; otherwise, there would be inconsistencies due to multiple branch transactions, which would undermine the effectiveness of distributed transactions.

懂了，是我钻牛角尖了，误以为提交阶段异常会回滚，

有个文章说【如果 Try 阶段无法锁定资源，或者 Confirm 阶段发生异常，那么整个全局事务就会回滚】

实际上try阶段做的锁定资源就是为了确保提交阶段能正常执行，所以【Confirm 阶段发生异常，那么整个全局事务就会回滚】这句话是不对的，这样理解对吧

funky-eyes commented 6 days ago

是的，后半句是不对的。tcc是基于资源锁定，资源释放，和资源提交来做的，一阶段先预留资源，这样这个资源就被拿走了锁定住，其它的事务是动不了这个被前者拿走的资源，再根据一阶段所有分支事务的状态是否正常，决议二阶段。也就是一阶段如果决议，理论上二阶段执行的内容只是取消或提交资源，必然是成功的，如果不成功，重试情况下也应该会成功。

Yes, the latter part is incorrect. TCC (Try-Confirm-Cancel) is based on resource locking, resource release, and resource commitment. In the first phase, resources are reserved, which locks them down so that other transactions cannot access the resources taken by the previous one. Then, depending on the status of all branch transactions in the first phase, a decision is made in the second phase.

In other words, if the decision is made in the first phase, theoretically, the actions in the second phase (either canceling or committing resources) will succeed. If it doesn’t succeed, it should also succeed in a retry scenario.

chenbihao commented 6 days ago

疑问解决，谢谢解答

apache / incubator-seata