apache / servicecomb-pack

Apache ServiceComb Pack is an eventually data consistency solution for micro-service applications. ServiceComb Pack currently provides TCC and Saga distributed transaction co-ordination solutions by using Alpha as a transaction coordinator and Omega as an transaction agent .
https://servicecomb.apache.org/
Apache License 2.0
1.93k stars 435 forks source link

compensation method invoke before compensable method #346

Open stashslash opened 5 years ago

stashslash commented 5 years ago

i test the saga-spring-demo

if i modify the HotelBookingService add some sleep code, like:

@Compensable(compensationMethod = "cancel", timeout = 5)
  void order(HotelBooking booking) {
    if (booking.getAmount() > 2) {
      throw new IllegalArgumentException("can not order the rooms large than two");
    }
    booking.confirm();

   // sleep 10 sec, simulate slow db
    try {
      Thread.sleep(10 * 1000);
    } catch (InterruptedException e) {
      e.printStackTrace();
    }

    bookings.put(booking.getId(), booking);
  }

then, i call the booking rest api, and got the booking result from hotel service and car service, the result looks like:

it's looks like car service compensated success, but hotel service compensated failure.

so, did i missing anything?

zhfeng commented 5 years ago

It looks like the timeout happens during the HotelBookingService is sleeping ? And the compensate method was invoked before the booking is finished. So it maybe need to add a lock to check if the transaction is cancelled.

void order(HotelBooking booking) {
...
    locker.lock();
    if (! isCancelled) {
         doBooking();
    } else {
         throw Exception("transaction is cancelled");
    }
    locker.unlock();
}

void cancel(HotelBooking booking) {
     locker.lock();
     isCancelled = true;
     doCancel();
     locker.unlock();
}
stashslash commented 5 years ago

@zhfeng thx for reply. like you said, and i know add a lock should works. but my confusion is if i should add a lock check on each compensate method ? or it has a better way to solve that, like servicecmob-saga help me do it.

zhfeng commented 5 years ago

OK, I see. The alpha server needs to check the TxEndedEvent is received before doing the compensate. would you mind to raise a JIRA for this user case ?

Thanks @stashslash

WillemJiang commented 5 years ago

As it is caused by the timeout, Alpha don't have chance to know if the compensable method is finished or not. Alpha have to abort the transaction by invoke the compensation method.

I don't think a lock could resolve this issue perfectly, because of the timeout could be caused by lot of reason. Such as the if the network connection is broken, Alpha cannot receive the TxEndedEvent, but the compensable method is finished.

I'm think about let the Omega take control of timeout checking, it could be much better for us to handle the timeout situation.

stashslash commented 5 years ago

it sounds great! and i raised SCB-1057

zhfeng commented 5 years ago

@WillemJiang I am not sure how the Omega can handle this timeout situation ? The Alpha need to guarantee that the compensate method should be invoked after the local transaction is finished, is it right ?

WillemJiang commented 5 years ago

@zhfeng Alpha cannot know that, how about LRA? Does the Coordinator know if the action is timeout.

zhfeng commented 5 years ago

@WillemJiang I think the current Narayana LRA implements the timeout at the coordinate side.

  1. set a schedule to cancel the transaction when enlist the participant in LRARecord.
  2. doEnd in the LRA Transaction when timeout happens.

So it maybe has the similar situation that the compensate method could be invoked before the business is done. And I think it is not clear described at the LRA spec timeout.

I think we can bring this case to the LRA issues.

zhfeng commented 5 years ago

@stashslash I proposal to update the booking demo at this timeout situation. Can you check this could resolve your problem ?

As we discussed, the coordinator cannot guarantee that the compensate method will be invoked after the business is done.

WillemJiang commented 5 years ago

@zhfeng My question is what does doEnd do when the timeout happens. If the Omega can cancel the invocation that could be great.

zhfeng commented 5 years ago

@WillemJiang it cancels the transaction when timeout happens and this is done at the coordinator side.

zhfeng commented 5 years ago

@stashslash @WillemJiang it looks like we need to indicate in the cancel method that the compensate is ongoing. Now I think it might throw the Exception which could cause the alpha server to re-invoke this compensate method later.

void cancel(HotelBooking booking) throws Exception {
    Integer id = booking.getId();
    if (bookings.containsKey(id)) {
      bookings.get(id).cancel();
    } else {
      throw new Exception("can not cancel " + booking.getId());
    }
}

Also we could consider to add the status interface which reports to the alpha server.

WillemJiang commented 5 years ago

+1 for adding the status interface. BTW, I think we can let the omega check the type of exception to decide if it need to retry it again.

stashslash commented 5 years ago

@zhfeng if Omega could handle status,we just need bookings.get(id).cancel(); in compensate method, right?

zhfeng commented 5 years ago

@stashslash I think the compensate still needs to return the result to indicate if it has finished the work.

CompensateStatus cancel(HotelBooking booking) {
    Integer id = booking.getId();
    try {
        if (bookings.containsKey(id)) {
            bookings.get(id).cancel();
            return COMPENSATE_OK;  // we finish the compensate work 
        } else {
            return COMPENSATING; // we can not find the booking record to be compensated and it's maybe ongoing, so it needs to be re-invoked later
        }
    } catch (Exception e) {
           return COMPENSATE_FAIL; // sorry, we can not do the compensate work because there is something throws the exception
    }
}

The alpha will check the status by asking the Omega to report it periodically until the compensate is OK or FAIL.

stashslash commented 5 years ago

@zhfeng im not sure compensate how to catch exception from compensable, because i think only compensable itself can know if happen exception and handle COMPENSATING status by itself much better.

so, like you snippets, if something wrong happens in compensable, it'll hanging COMPENSATING status, right?

and why not let Omega to report COMPENSATING status around invoke compensable method, like this i think i just need bookings.get(id).cancel(); in compensate method.

zhfeng commented 5 years ago

@stashslash I agree with you that the Omega reports the COMPENSATING STATUS during the compensate method invoked. So if we only have the bookings.get(id).cancel() in the compensate method, let's see what could happen

1) everything goes smooth and the booking record has been inserted, and this is a good situation that the cancel() can returns OK. The Omega will report COMPENSATE_OK.

2) the same as above but the cancel() returns FAIL or throws the exception to indicate this status. The Omega will report COMPENSATE_FAIL.

3) The booking record has not been inserted, and the bookings.get(id).cancel() will throw the NPE exception I assume. The Omega will catch this exception and report COMPENSATING STATUS as we expect the alpha will re-invoke this method later. And the booking record will be inserted at last, and the cancel() will return OK or FAIL finally. The Omega will report COMPENSATE_OK or FAIL.

4) The booking record inserts failed due to the database errors and the Omega will catch this exception and mark this saga transaction is FAIL.

So the Omega will be able to know the return values or the exception from the cancel() method when reporting the status. It does make sense if we can well define these contracts between the framework and the applications.

stashslash commented 5 years ago

yes, its my fault, i think some actions like bookings.get(id).cancel(); will throws exception but actually not.

you're right and its clear, @zhfeng thx.

so should i close this issue?

zhfeng commented 5 years ago

you can hold the issue until we resolve the SCB-1057