apache / servicecomb-pack

Apache ServiceComb Pack is an eventually data consistency solution for micro-service applications. ServiceComb Pack currently provides TCC and Saga distributed transaction co-ordination solutions by using Alpha as a transaction coordinator and Omega as an transaction agent .
https://servicecomb.apache.org/
Apache License 2.0
1.93k stars 436 forks source link

0.6.0 alpha-server constantly throw NPE in "Global Tx Timeout Situation" #695

Open Richard-Yi opened 3 years ago

Richard-Yi commented 3 years ago

Testing scenarios

I was testing how ServiceComb behave when the global Tx is timed out.

There were 3 services:

The order service calls the account service and the storage service in turn.

  1. The order service creates an order locally
  2. The account service deducts the corresponding amount of money of the corresponding user
  3. The storage service deducts the inventory of the corresponding count of the corresponding product

The testing scenarios is as below:

  1. Order Service set the global Tx timout to 10s via @SagaStart(timeout = 10)

  2. storage service will simulate a timeout of 10 seconds via Thread.sleep()

code

order serivce

    @SagaStart(timeout = 10)
    @Transactional(rollbackFor = Exception.class)
    public boolean create(Order order) {

        // 1. 创建订单
        orderLocalService.createLocal(order);

        // 扣钱
        String result = accountClient.decrease(order.getUserId(), order.getMoney());
        if (Objects.equals(result, "fail")) {
            log.error("扣库存失败,回滚本地事务");
            throw new RuntimeException("扣钱失败");
        }

        // 扣库存
        result = storageClient.decrease(order.getProductId(), order.getCount());
        if (Objects.equals(result, "fail")) {
            log.error("扣库存失败,回滚本地事务");
            throw new RuntimeException("扣库存失败");
        }

        return true;
    }

account service

    @Override
    @Transactional(rollbackFor = Exception.class)
    @Compensable(compensationMethod = "compensate")
    public void decrease(Long userId, BigDecimal money) {
        accountMapper.decrease(userId, money);
    }

storage service

    @Transactional(rollbackFor = Exception.class)
    @Compensable(compensationMethod = "compensate")
    public void decrease(Long productId, Integer count) {

        log.info("执行 storage 正常流程,  全局事务ID: {}, 分支事务ID: {}", omegaContext.globalTxId(), omegaContext.localTxId());

        storageMapper.decrease(productId, count);
        // 11s time out
        try {
            Thread.sleep(11000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

Problem Description

In the above test case, when the global transaction times out, alpha-server will continuously throw NPE exceptions. The exception log is as follows:

2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-14] o.a.s.p.a.f.SagaActor                    : recovery SagaStartedEvent{timeout=10} {"@type":"SagaStartedEvent","serviceName":"order-server","instanceId":"order-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","createTime":1612851846898,"timeout":10,"type":"SagaStartedEvent"}
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-14] o.a.s.p.a.f.SagaActor                    : recovery {"@type":"TxStartedEvent","serviceName":"account-server","instanceId":"account-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"dc5e50b7-260e-4f8a-9e42-ead972d6ab28","createTime":1612851846912,"compensationMethod":"public void com.ricstudio.account.service.AccountServiceImpl.compensate(java.lang.Long,java.math.BigDecimal)","payloads":"AQMJAgEAamF2YS5tYXRoLkJpZ0RlY2ltYewBAmQA","retryMethod":"","forwardRetries":0,"forwardTimeout":0,"reverseRetries":0,"reverseTimeout":0,"retryDelayInMilliseconds":5,"type":"TxStartedEvent"}
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-14] o.a.s.p.a.f.SagaActor                    : recovery {"@type":"TxEndedEvent","serviceName":"account-server","instanceId":"account-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"dc5e50b7-260e-4f8a-9e42-ead972d6ab28","createTime":1612851846944,"type":"TxEndedEvent"}
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-12] o.a.s.p.a.f.SagaActor                    : recovery {"@type":"TxStartedEvent","serviceName":"storage-server","instanceId":"storage-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"2d815537-efa5-43ca-a2e4-3078679c8903","createTime":1612851846950,"compensationMethod":"public void com.ricstudio.storage.service.StorageServiceImpl.compensate(java.lang.Long,java.lang.Integer)","payloads":"AQMJAgIU","retryMethod":"public void com.ricstudio.storage.service.StorageServiceImpl.decrease(java.lang.Long,java.lang.Integer)","forwardRetries":3,"forwardTimeout":0,"reverseRetries":0,"reverseTimeout":0,"retryDelayInMilliseconds":5,"type":"TxStartedEvent"}
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-12] o.a.s.p.a.f.SagaActor                    : recovery null
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-14] o.a.s.p.a.f.SagaActor                    : recovery completed [20fe99d8-1339-4bda-9a50-285b66eeab3f] state=SUSPENDED
2021-02-09 14:24:36.871 ERROR ruichadeMacBook.local --- [t-dispatcher-14] o.a.s.p.a.f.SagaActor                    : stop [20fe99d8-1339-4bda-9a50-285b66eeab3f] fail
2021-02-09 14:24:36.871 ERROR ruichadeMacBook.local --- [t-dispatcher-14] a.a.OneForOneStrategy                    : null
java.lang.NullPointerException
    at org.apache.servicecomb.pack.alpha.core.fsm.repository.model.SagaSubTransaction$Builder.build(SagaSubTransaction.java:103)
    at org.apache.servicecomb.pack.alpha.fsm.spring.integration.akka.SagaDataExtension$SagaDataExt.lambda$stopSagaData$0(SagaDataExtension.java:76)
    at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
    at org.apache.servicecomb.pack.alpha.fsm.model.TxEntities.forEach(TxEntities.java:32)
    at org.apache.servicecomb.pack.alpha.fsm.spring.integration.akka.SagaDataExtension$SagaDataExt.stopSagaData(SagaDataExtension.java:69)
    at org.apache.servicecomb.pack.alpha.fsm.SagaActor.beforeStop(SagaActor.java:387)
    at org.apache.servicecomb.pack.alpha.fsm.SagaActor.lambda$new$31(SagaActor.java:325)
    at akka.persistence.fsm.japi.pf.FSMStateFunctionBuilder$2.apply(FSMStateFunctionBuilder.java:84)
    at akka.persistence.fsm.japi.pf.FSMStateFunctionBuilder$2.apply(FSMStateFunctionBuilder.java:81)
    at akka.japi.pf.CaseStatement.apply(CaseStatements.scala:18)
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
    at akka.japi.pf.CaseStatement.applyOrElse(CaseStatements.scala:13)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
    at akka.persistence.fsm.PersistentFSMBase.processEvent(PersistentFSMBase.scala:459)
    at akka.persistence.fsm.PersistentFSMBase.processEvent$(PersistentFSMBase.scala:456)
    at akka.persistence.fsm.AbstractPersistentFSMBase.processEvent(PersistentFSMBase.scala:639)
    at akka.persistence.fsm.PersistentFSMBase.akka$persistence$fsm$PersistentFSMBase$$processMsg(PersistentFSMBase.scala:453)
    at akka.persistence.fsm.PersistentFSMBase$$anonfun$receive$1.applyOrElse(PersistentFSMBase.scala:448)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at akka.persistence.fsm.AbstractPersistentFSM.akka$persistence$Eventsourced$$super$aroundReceive(PersistentFSM.scala:441)
    at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:743)
    at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:222)
    at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:221)
    at akka.persistence.fsm.AbstractPersistentFSM.aroundReceive(PersistentFSM.scala:441)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
    at akka.actor.ActorCell.invoke(ActorCell.scala:557)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor                    : recovery SagaStartedEvent{timeout=10} {"@type":"SagaStartedEvent","serviceName":"order-server","instanceId":"order-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","createTime":1612851846898,"timeout":10,"type":"SagaStartedEvent"}
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor                    : recovery {"@type":"TxStartedEvent","serviceName":"account-server","instanceId":"account-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"dc5e50b7-260e-4f8a-9e42-ead972d6ab28","createTime":1612851846912,"compensationMethod":"public void com.ricstudio.account.service.AccountServiceImpl.compensate(java.lang.Long,java.math.BigDecimal)","payloads":"AQMJAgEAamF2YS5tYXRoLkJpZ0RlY2ltYewBAmQA","retryMethod":"","forwardRetries":0,"forwardTimeout":0,"reverseRetries":0,"reverseTimeout":0,"retryDelayInMilliseconds":5,"type":"TxStartedEvent"}
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor                    : recovery {"@type":"TxEndedEvent","serviceName":"account-server","instanceId":"account-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"dc5e50b7-260e-4f8a-9e42-ead972d6ab28","createTime":1612851846944,"type":"TxEndedEvent"}
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor                    : recovery {"@type":"TxStartedEvent","serviceName":"storage-server","instanceId":"storage-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"2d815537-efa5-43ca-a2e4-3078679c8903","createTime":1612851846950,"compensationMethod":"public void com.ricstudio.storage.service.StorageServiceImpl.compensate(java.lang.Long,java.lang.Integer)","payloads":"AQMJAgIU","retryMethod":"public void com.ricstudio.storage.service.StorageServiceImpl.decrease(java.lang.Long,java.lang.Integer)","forwardRetries":3,"forwardTimeout":0,"reverseRetries":0,"reverseTimeout":0,"retryDelayInMilliseconds":5,"type":"TxStartedEvent"}
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor                    : recovery null
2021-02-09 14:24:36.871  INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor                    : recovery completed [20fe99d8-1339-4bda-9a50-285b66eeab3f] state=SUSPENDED
2021-02-09 14:24:36.872 ERROR ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor                    : stop [20fe99d8-1339-4bda-9a50-285b66eeab3f] fail
2021-02-09 14:24:36.872 ERROR ruichadeMacBook.local --- [t-dispatcher-29] a.a.OneForOneStrategy                    : null
java.lang.NullPointerException
    at org.apache.servicecomb.pack.alpha.core.fsm.repository.model.SagaSubTransaction$Builder.build(SagaSubTransaction.java:103)
    at org.apache.servicecomb.pack.alpha.fsm.spring.integration.akka.SagaDataExtension$SagaDataExt.lambda$stopSagaData$0(SagaDataExtension.java:76)
    at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
    at org.apache.servicecomb.pack.alpha.fsm.model.TxEntities.forEach(TxEntities.java:32)
    at org.apache.servicecomb.pack.alpha.fsm.spring.integration.akka.SagaDataExtension$SagaDataExt.stopSagaData(SagaDataExtension.java:69)
    at org.apache.servicecomb.pack.alpha.fsm.SagaActor.beforeStop(SagaActor.java:387)
    at org.apache.servicecomb.pack.alpha.fsm.SagaActor.lambda$new$31(SagaActor.java:325)
    at akka.persistence.fsm.japi.pf.FSMStateFunctionBuilder$2.apply(FSMStateFunctionBuilder.java:84)
    at akka.persistence.fsm.japi.pf.FSMStateFunctionBuilder$2.apply(FSMStateFunctionBuilder.java:81)
    at akka.japi.pf.CaseStatement.apply(CaseStatements.scala:18)
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
    at akka.japi.pf.CaseStatement.applyOrElse(CaseStatements.scala:13)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
    at akka.persistence.fsm.PersistentFSMBase.processEvent(PersistentFSMBase.scala:459)
    at akka.persistence.fsm.PersistentFSMBase.processEvent$(PersistentFSMBase.scala:456)
    at akka.persistence.fsm.AbstractPersistentFSMBase.processEvent(PersistentFSMBase.scala:639)
    at akka.persistence.fsm.PersistentFSMBase.akka$persistence$fsm$PersistentFSMBase$$processMsg(PersistentFSMBase.scala:453)
    at akka.persistence.fsm.PersistentFSMBase$$anonfun$receive$1.applyOrElse(PersistentFSMBase.scala:448)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at akka.persistence.fsm.AbstractPersistentFSM.akka$persistence$Eventsourced$$super$aroundReceive(PersistentFSM.scala:441)
    at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:743)
    at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:222)
    at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:221)
    at akka.persistence.fsm.AbstractPersistentFSM.aroundReceive(PersistentFSM.scala:441)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
    at akka.actor.ActorCell.invoke(ActorCell.scala:557)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

The alpha UI show the suspended tx reaches 45.46K+...

image

My exploration of the Problem

I debug the alpha in my local env and make the break point according to the error stacktrace. Then I found what cause the probleam.

when alpha found the global tx is timed out, the global tx state will be set to SUSPEND,and will stop the sagaActor.

image

before stopping, it will save the GlobalTransaction to the es. The NPE happens when building the subTransaction.

image

although the global tx is SUSPEND, the sub tx is still active and its endtime is null, and then cause NPE here.

coolbeevip commented 3 years ago

Thank you, I need some time to deal with this problem

coolbeevip commented 3 years ago

This bug same as the JIRA issue SCB-2004 and has been fixed in #667

Richard-Yi commented 3 years ago

This bug same as the JIRA issue SCB-2004 and has been fixed in #667

sincerely Thanks for the reply. @coolbeevip But sorry, I didn’t figure out that these two are the same problem. Could you please explain more about it. And is the problem fixed in 0.0.7-SNAPSHOT

coolbeevip commented 3 years ago

This bug same as the JIRA issue SCB-2004 and has been fixed in #667

sincerely Thanks for the reply. @coolbeevip But sorry, I didn’t figure out that these two are the same problem. Could you please explain more about it. And is the problem fixed in 0.0.7-SNAPSHOT

You can try build 0.7.0-SNAPSHOT with branch master

Richard-Yi commented 3 years ago

This bug same as the JIRA issue SCB-2004 and has been fixed in #667

sincerely Thanks for the reply. @coolbeevip But sorry, I didn’t figure out that these two are the same problem. Could you please explain more about it. And is the problem fixed in 0.0.7-SNAPSHOT

You can try build 0.7.0-SNAPSHOT with branch master

@coolbeevip well, I tried 0.7.0-SNAPSHOT with branch master, the problem still exists and the exception stack trace is the same.

coolbeevip commented 3 years ago

This bug same as the JIRA issue SCB-2004 and has been fixed in #667

sincerely Thanks for the reply. @coolbeevip But sorry, I didn’t figure out that these two are the same problem. Could you please explain more about it. And is the problem fixed in 0.0.7-SNAPSHOT

You can try build 0.7.0-SNAPSHOT with branch master

@coolbeevip well, I tried 0.7.0-SNAPSHOT with branch master, the problem still exists and the exception stack trace is the same.

Hmm. I recently released docker https://hub.docker.com/repository/docker/coolbeevip/servicecomb-pack/tags?page=1&ordering=last_updated, can you try this again?

docker run \
  -p 8080:8080 \
  -p 8090:8090 \
  -p 5432:5432 \
  coolbeevip/servicecomb-pack:0.7.0-all-in-one