dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.06k stars 2.03k forks source link

Non-transient errors in reliable stream processing. #915

Open jason-bragg opened 8 years ago

jason-bragg commented 8 years ago

When performing high performance reliable stream processing using Orleans streams, persisting processed results for each event may lead to performance issues. In these cases, results tend to be kept in memory until a key event or time interval triggers persistence of a ‘checkpoint’. If an error occurs in the stream processing, the consumer may recover by loading the last checkpoint, rewinding the stream to the point where the checkpoint was taken, and then continuing normal processing from there.

This approach works well for recovering from transient errors, but it requires that the consumer know with certainty if an error is transient. If this recovery logic is attempted on a non-transient error (an event processing bug for instance), this could leave the stream processing logic in an infinite recovery loop.

It is reasonable to place the responsibility of detecting such infinite recovery loops on the application logic. The application will have a better understanding of the nature of the stream processing and how to handle unexpected errors. However, I’m raising this issue to explore capabilities Orleans streams could provide to help developers address this issue.

jason-bragg commented 8 years ago

Rewind Threshold Warning

In the above scenario, since a consumer would repeatedly rewind the stream to the point where the processing was when it wrote its last checkpoint, an infinite recovery loop could be detected by the stream infrastructure by detecting repeated rewind requests to the same point in a stream on the same subscription.

Upon detecting a consumer continuing to rewind to the same stream position on a subscription, Orleans streams could communicate this to the consumer by calling the consumers OnErrorAsync with a specific exception (RepeatedRewindWarningException?)

Application developers could detect this exception and act accordingly to break out of the infinite retry loop.

gabikliot commented 8 years ago

Is it just identical to the poison message scenario, is it something special in the way Orleans streaming works? Sounds like a regular poison message problem. What is event hub solution to that? Do they (maybe event processor) provide attempts counter and do something upon it?

sergeybykov commented 8 years ago

This is more of a "poison sequence of messages" problem. Since the stream gets rewound to an early offset, we can't pinpoint the specific poison message.

gabikliot commented 8 years ago

So its not the case of a specific message throwing upon its processing?

sergeybykov commented 8 years ago

Not really. It may throw here. But because after the grain deactivates itself (because it potentially got into an inconsistent state) it will rewind upon reactivation to an early offset in the stream, and will repeat the whole thing again.

jason-bragg commented 8 years ago

Scenario to help explore problem

We want to reliably calculate the sum of a stream of integers and store the result.

Recoverable Stream of integers where ‘iN’ is the integer event, and ‘tN’ is the stream sequence token for that event: i1,t1, i2,t2, … i1000,t1000

1) Consumer receives first event in stream (i1,t1) and checkpoints the sum and sequence token (Sum(i1),t1). 2) After consumer receives next 499 events a ‘checkpoint’ timer fires, so the consumer checkpoints the sum and the sequence token of the last event processed (sum(i1-i500), t500). 3) At event (i700,t700) the consumer encounters an error, and deactivates. 4) The stream infrastructure, having received an error while delivering event (i700, t700), retries delivery of the event. 5) Redelivery reactivates the consumer grain which reloads it’s last checkpoint (sum(i1-i500),t500), and rewinds the stream to t500. 6) If the error was transient and no longer occurs, recovery will succeed and processing will continue. If it was not, consumer will encounter the error again, and deactivate, repeating the recovery logic.

If the error is non-recoverable, this may lead to an infinite recovery loop.

Question: How does the application know it’s in this potentially infinite loop?

gabikliot commented 8 years ago

The example you provided is a clear example of single poison msg, number 700.

I am not in general against new concepts, as poison subsequence, but I would encourage first to look deeper into existing abstractions before inventing new ones.

If the example was that no single msg can be blamed, it would be different. But that is not your example. Of course, one can come up with such theoretical example, but is it really your case?

gabikliot commented 8 years ago

So to your last question, seems like you can deal with this case just like you would deal with poison msg. Thus my question: Does event hub provide any support for poison msg?

jason-bragg commented 8 years ago

"The example you provided is a clear example of single poison msg, number 700." I don't disagree, but the usual poison message scenario, to my understanding, deals with errors processing a single message. Our event delivery retry policy is sufficient to handle the usual poison message scenario, but it is insufficient for the example scenario. Am I missing something?

"one can come up with such theoretical example, but is it really your case?" This is a real case from a real Orleans customer. One you're quite familiar with. :)

"Does event hub provide any support for poison msg?" To my knowledge, event hub has no concept of poison message, but if it did, I don't know how that would resolve this.

gabikliot commented 8 years ago

The customer is real, but subsequence example is not i think and that is what I meant. Our delivery retry policy is of course NOT sufficient to handle poison msg. First, we don't differentiate delivery and processing errors, second we don't count errors across cursors, across all attempts to process this msg. But we can extend it.

The question here if I understand it correctly is if you are going to disallow rewind upon X time, or consider an individual msg as bad and stop delivering it upon failure to process X times. There are scenarios when the former may be required but I am not convinced it is the case here and you didn't convince me it is. Your example was the latter. Your case can be handled I think by properly implementing poison msg. Can't it?

gabikliot commented 8 years ago

Perhaps I misunderstood the problem, since Sergey wrote that you can't pinpoint the specific poison msg. In such a case it's different and indeed requires a concept of poison sequence. But Jasons example was different. Thus I wonder what is the actual case here.

jason-bragg commented 8 years ago

"The customer is real, but subsequence example is not i think and that is what I meant." This is not hypothetical. The infinite recovery loop is a real scenario being encountered and we've been asked to consider functionality Orleans streams can provide to help.

"The question here if I understand it correctly is if you are going to disallow rewind upon X time, or consider an individual msg as bad and stop delivering it upon failure to process X times." The goal of this thread is to explore solutions, the "Rewind Threshold Warning" is just one suggested solution. The "Rewind Threshold Warning" proposal did not disallow rewind or stop delivery, it simply warned the consumer of its aberrant behavior and left it to the consumer to determine the appropriate action to take.

"Your case can be handled I think by properly implementing poison msg." Can you elaborate on how Orleans streams could properly handle poison messages and how that would address the provided scenario?

"you can't pinpoint the specific poison msg" In the scenario I tried not to be too specific about the nature of the error; whether it was a bad message, a bug, an outage of a dependent service, ... In my view, what is relevant is whether the error is recoverable or not. Triggering recovery logic only makes sense if the error is considered recoverable. Since it's not always clear what errors are recoverable, there must be some mechanism for determining when to give up on recovery. In the example scenario, since the grain is being deactivated and losing all state, it has no memory of how many times its tried recovering. We could leave it to the application layer to solve this, but there may be things the stream infrastructure can do to help application developers.

gabikliot commented 8 years ago

Just too many unrelated things bundled together. 😞

The subscribe at point x is not directly related to rewind or recovery. We use it to do that, but streaming API has absolutely nothing about recovery. So unless you add a new concept of recovery, you have to talk in,the old concept : Just subscribe at point x. In,this case, you are not deciding when to stop recovering.,buy hoe many times to attemp to process same msg.

Of course the scenario is real. I,understood that from the very first sentence. I am asking, in order to solve it, should we solve it as a case of poison msg or come up with new poison sequence concept?

To solve poison msg you need to keep state, per msg, on how many times processing failed. The state can be kept in memory, in the cache msg wrapper. The only hard question is if this is a count per consumer or global. Can start with global, much easier and less state. Once counter is hit,u can either stop ever trying deliver that msg, or log a warning. Same as in your suggestion.

If you would do what you suggested (warning on number of subscribes), would you need to count attempts to subscribe at certain offset, so basically comparable state, from state size perspective?

jason-bragg commented 8 years ago

"in order to solve it, should we solve it as a case of poison msg or come up with new poison sequence concept?" Ah. I think I understand. I don't think this fits the poison message problem. I agree it would be valuable to spend some time conceptually framing the problem prior to suggesting solutions. I don't know of poison sequence is the right concept either, because both cases seem to imply the error is related to the data. I'll give more thought to conceptual framing.

"To solve poison msg you need to..." That makes sense. We could wrap errors thrown by the OnNextAsync to determine the difference between delivery errors and processing errors.

"would you need to count attempts to subscribe at certain offset" Yeah. I'd initially envisioned this being tracked per subscription, not globally, but I've not really delved that deep into it. I was kinda hoping better ideas would surface. :)

gabikliot commented 8 years ago

So a potential better/easier idea is to handle it as poison msg. In your case, is it always the same msg that triggers rewind? If yes, u can solve it as poison msg case. If not, than u can't.