Message loss due to race conditions with ContinueAsNew

cgillum commented 6 years ago

Summary

Message loss or instance termination may be observed if an orchestrator completes after calling ContinueAsNew and subsequently processes any other message before restarting.

Details

When the orchestrator completes execution after ContinueAsNew is called, the status is internally marked as "completed" and a new message is enqueued to restart it. During this window between completion and restart, it's possible for other messages to arrive (for example, raised events or termination messages). Because the internal state of the orchestration is completed, those messages will be dropped. It's also possible for the DTFx runtime to terminate the instance claiming possible state corruption.

Repro

Start with the counter sample
Create a new instance
Call RaiseEventAsync("operation", "incr") multiple times without waiting

Expected: Calling ContinueAsNew multiple times in quick succession should never be an issue. Many workloads may require this, especially actor-style workloads.

Workaround: Have the client wait a few seconds before sending events that may cause the orchestrator to do a ContinueAsNew. This give the instance time to get a new execution ID.

ericleigh007 commented 6 years ago

I guess the workaround to make things a bit more reliable is to use a while or for in the orchestrator, thereby staving off the ContinueAsNew until N number of events. Otherwise, Durable functions are really great stuff. When do you think this extreme fragility with one of the guiding principles of DF's will be addressed? Do we need major function exec changes or changes within DF's, or something else..

Thanks again for the great work.

Hassmann commented 6 years ago

Yes, please, give us a solution. I am using a singleton and it is not reliable. I would have to run a web app service alone for a very small task, if this is not handled. Or give an indication that you do not plan to solve this in the near future, please.

cgillum commented 6 years ago

This one definitely needs to be solved, and it's high on the list since it impacts reliability for an important scenario. The change needs to be made in the Azure Storage extension of the Durable Task Framework: https://github.com/Azure/durabletask/blob/azure-functions/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs

christiansparre commented 6 years ago

Yes, this one needs to be fixed. Was just playing around and ran into this immediately when creating a singleton function :(

cgillum commented 6 years ago

I spent more time looking at this and experimenting with different fixes. Ultimately, I came to the conclusion that fixing this would be far more difficult than I originally anticipated. The short version is that the Durable Task Framework just wasn't designed for RPC/actor-style messaging workloads, and this issue is one example of why.

This is pretty painful (and a little embarrassing), but I think the actor-style pattern will need to be removed from the list of supported patterns. If it comes back, most likely it will need to come back in a different form and with a different backend implementation - i.e. something other than the Durable Task Framework in its current form. I've received the most critical feedback on this pattern from trusted sources (e.g. awkward programming model and confusing performance expectations), so I feel deprecating it is the right decision moving forward.

I want to be clear though. WaitForExternalEvent, ContinueAsNew, and "stateful singletons" will continue to be fully supported. This is really about clarifying which patterns can be implemented using these tools. For example, WaitForExternalEvent is still perfectly valid for the "human interaction" pattern. ContinueAsNew is also perfectly valid for implementing "eternal orchestrations". The pattern which is broken is the "actor" pattern where arbitrary clients send event messages at arbitrary times to a running instance.

I'll make a point of updating the samples and the documentation to reflect this change of direction. Let me know if there are any questions I can answer or clarifications I can provide.

christiansparre commented 6 years ago

I'm sorry to hear that @cgillum, this was really one of the missing pieces that would allow us to move a quite complex message processing application over to functions. We have a single part of the application where we need to manage concurrency pretty strictly, but the other 90% of the application is perfect for functions.

I really hope this comes back in one form or the other because it would be a pretty strong capability in the functions world and would simplify my life quite massively :)

Thank you for explaining your reasoning and keep up the good work!

SimonLuckenuik commented 6 years ago

Thanks for the update @cgillum, I hope the Actor concept will come back in one form or another, this was enabling very interesting and complex scenarios.

Hassmann commented 6 years ago

Thanks, Chris. It's very valuable to have clarity, no matter the news.

I could work around the issue so far and will follow this path then. Performance and scalability are still a concern, though. So many seconds until a scheduled function is actually executed... Could we get in contact to put my mind at ease with a few questions?

Andreas Hassmann.

hello@tezos.blue

jedjohan commented 6 years ago

Thanks for the info Chris. A bit sad to see as I thought the actor pattern was the most interesting one. I ran into the problem for a small testing/learning e-shop project. When I "spam" the "add product to cart"-button some really strange things happen. Hope you find a solution :)

cfe84 commented 6 years ago

Adding to the "that's really sad" pile, this would have been SUCH a differentiator against Lambdas.

cgillum commented 6 years ago

I've done a quick analysis based on some experiments I ran a few months back and have updated the description at the top of this thread with my thoughts on how we could potentially fix this issue.

FYI @gled4er

gled4er commented 6 years ago

Hello @cgillum,

Thank you very much!

SimonLuckenuik commented 6 years ago

@cgillum , just thinking out loud, maybe adding some behavior to ContinueAsNew to fetch remaining events (based on previous internal instance ID), just after starting the new orchestrator, making sure no events are lost?

SimonLuckenuik commented 6 years ago

Or maybe adding a new API to fetch the "waiting but not processed events" so that we can fetch them and provide them to the input of ContinueAsNew ?

cgillum commented 5 years ago

Resolved in v1.8.0 release.

cfe84 commented 5 years ago

YOU, SIR, ARE THE BEST! \o/

Azure / azure-functions-durable-extension