eclipse / microprofile-lra

microprofile-lra
Apache License 2.0
101 stars 30 forks source link

Clarify Timeout handling #302

Closed fachat closed 4 years ago

fachat commented 4 years ago

I read in https://github.com/eclipse/microprofile-lra/blob/master/spec/src/main/asciidoc/microprofile-lra-spec.adoc#329-timing-out-lras that "the ability to compensate or complete may be transient capabilities of a service so participants can also be timed out."

It is unclear to me how that situation should be handled. When adding network delays to the various communication paths, one could easily construct a situation where a heuristic result may occur. E.g. the participant registers with a timeout, but network delays its message (think async messages). In the meantime, the orchestrator calls the LRA to complete (close). The participant does not know about it yet, so it times out to cancellation and you end up with an inconsistent situation.

fachat commented 4 years ago

What could be a solution?

A time limit may be a hint to the LRA, but the participant should not assume anything about the state of the LRA, and must inform the LRA that it is about to cancel, which in turn cancels the whole LRA. It may get a 409 conflict in case of the above example (when the orchestrator has already closed the LRA), that tells the participant that it cannot cancel but must complete. It is a burden on the participant, but one that must be taken to avoid this heuristic situation.

In my experience it is of absolute value to avoid heuristic situations from the start by making the specifications as clear as possible. I've had to fix too many systems that scaled up and produced 1000's such situations a day....

xstefank commented 4 years ago

Hi @fachat. Thanks for the question. Since LRA is based on eventual consistency and as you've mentioned, network delays must also be taken into account we, unfortunately, can't fully avoid heuristic situations.

To the timeout point. In your example, the participant cannot cancel on its own (or indeed heuristics cannot be prevented in such situations). In the coordination based implementation, the coordinator controls the value of the timeout so it knows that if the timeout defined by the participant (assuming it's the soonest timeout defined) elapsed after there was already an initiated Close action than it should be ignored. In other words, it should be the implementation (coordinator) that cancels the LRA from timeout, not the participant itself. The network delays must be taken into account when the timeout is defined, unfortunately.

I believe that you also see the issue with Complete/Compensate being called before the LRA method is run (correct me if I'm wrong please). We actually discussed something similar not so long ago in https://github.com/eclipse/microprofile-lra/issues/301.

fachat commented 4 years ago

Ok, thanks for the clarification. It must be clear that the participant cannot cancel on its own or heuristics will occur. It has to go back to the coordinator.

As it seems it is not just me struggling with the interpretation, would it be a good idea to add a section to the spec that defines "requirements" or "best practices" for the participant, also in the light of the discussion of #301 ? I could not find something in that direction in the spec but I've only glanced over it right now again.

xstefank commented 4 years ago

Just to clarify: it doesn't necessarily need to be a coordinator but rather it must go back to the implementation (we are trying to avoid references to coordinator in the specification because we want to encourage other approaches to the implementations, e.g. routing slip). I used the coordinator as an example as that is how Narayana implementation handles the timeouts.

I agree that such a section could be beneficial. However, we should wait for the outcome of #301 as it is still not clear whether we will place the requirement of timing handling between LRA and Complete/Compensate methods on users or we can move it to the specification.

Shall we continue on #301 and close this issue?

ochaloup commented 4 years ago

hi @fachat . Do you think this issue could be maybe closed as @xstefank suggested (the heuristic timeout issue was explained). And then you could create a new issue which would be asking to add a new section to the spec ("requiremens", "best practices",...) and we can discuss details there?

fachat commented 4 years ago

yes, thanks the issue has been explained.