Closed cgillum closed 5 years ago
I guess the workaround to make things a bit more reliable is to use a while or for in the orchestrator, thereby staving off the ContinueAsNew until N number of events. Otherwise, Durable functions are really great stuff. When do you think this extreme fragility with one of the guiding principles of DF's will be addressed? Do we need major function exec changes or changes within DF's, or something else..
Thanks again for the great work.
Yes, please, give us a solution. I am using a singleton and it is not reliable. I would have to run a web app service alone for a very small task, if this is not handled. Or give an indication that you do not plan to solve this in the near future, please.
This one definitely needs to be solved, and it's high on the list since it impacts reliability for an important scenario. The change needs to be made in the Azure Storage extension of the Durable Task Framework: https://github.com/Azure/durabletask/blob/azure-functions/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs
Yes, this one needs to be fixed. Was just playing around and ran into this immediately when creating a singleton function :(
I spent more time looking at this and experimenting with different fixes. Ultimately, I came to the conclusion that fixing this would be far more difficult than I originally anticipated. The short version is that the Durable Task Framework just wasn't designed for RPC/actor-style messaging workloads, and this issue is one example of why.
This is pretty painful (and a little embarrassing), but I think the actor-style pattern will need to be removed from the list of supported patterns. If it comes back, most likely it will need to come back in a different form and with a different backend implementation - i.e. something other than the Durable Task Framework in its current form. I've received the most critical feedback on this pattern from trusted sources (e.g. awkward programming model and confusing performance expectations), so I feel deprecating it is the right decision moving forward.
I want to be clear though. WaitForExternalEvent
, ContinueAsNew
, and "stateful singletons" will continue to be fully supported. This is really about clarifying which patterns can be implemented using these tools. For example, WaitForExternalEvent
is still perfectly valid for the "human interaction" pattern. ContinueAsNew
is also perfectly valid for implementing "eternal orchestrations". The pattern which is broken is the "actor" pattern where arbitrary clients send event messages at arbitrary times to a running instance.
I'll make a point of updating the samples and the documentation to reflect this change of direction. Let me know if there are any questions I can answer or clarifications I can provide.
I'm sorry to hear that @cgillum, this was really one of the missing pieces that would allow us to move a quite complex message processing application over to functions. We have a single part of the application where we need to manage concurrency pretty strictly, but the other 90% of the application is perfect for functions.
I really hope this comes back in one form or the other because it would be a pretty strong capability in the functions world and would simplify my life quite massively :)
Thank you for explaining your reasoning and keep up the good work!
Thanks for the update @cgillum, I hope the Actor concept will come back in one form or another, this was enabling very interesting and complex scenarios.
Thanks, Chris. It's very valuable to have clarity, no matter the news.
I could work around the issue so far and will follow this path then. Performance and scalability are still a concern, though. So many seconds until a scheduled function is actually executed... Could we get in contact to put my mind at ease with a few questions?
Andreas Hassmann.
hello@tezos.blue
Thanks for the info Chris. A bit sad to see as I thought the actor pattern was the most interesting one. I ran into the problem for a small testing/learning e-shop project. When I "spam" the "add product to cart"-button some really strange things happen. Hope you find a solution :)
Adding to the "that's really sad" pile, this would have been SUCH a differentiator against Lambdas.
I've done a quick analysis based on some experiments I ran a few months back and have updated the description at the top of this thread with my thoughts on how we could potentially fix this issue.
FYI @gled4er
Hello @cgillum,
Thank you very much!
@cgillum , just thinking out loud, maybe adding some behavior to ContinueAsNew to fetch remaining events (based on previous internal instance ID), just after starting the new orchestrator, making sure no events are lost?
Or maybe adding a new API to fetch the "waiting but not processed events" so that we can fetch them and provide them to the input of ContinueAsNew ?
YOU, SIR, ARE THE BEST! \o/
Summary
Message loss or instance termination may be observed if an orchestrator completes after calling
ContinueAsNew
and subsequently processes any other message before restarting.Details
When the orchestrator completes execution after
ContinueAsNew
is called, the status is internally marked as "completed" and a new message is enqueued to restart it. During this window between completion and restart, it's possible for other messages to arrive (for example, raised events or termination messages). Because the internal state of the orchestration is completed, those messages will be dropped. It's also possible for the DTFx runtime to terminate the instance claiming possible state corruption.Repro
Expected: Calling ContinueAsNew multiple times in quick succession should never be an issue. Many workloads may require this, especially actor-style workloads.
Workaround: Have the client wait a few seconds before sending events that may cause the orchestrator to do a ContinueAsNew. This give the instance time to get a new execution ID.