Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
715 stars 271 forks source link

ContinueAsNew replay behavior #2606

Open jasonvangundy opened 1 year ago

jasonvangundy commented 1 year ago

Discussed in https://github.com/Azure/azure-functions-durable-extension/discussions/2600

Originally posted by **jasonvangundy** September 25, 2023 I'm loving durable functions, but am fairly new and may be misunderstanding the expected behavior in this context. I am receiving IoT data on a regular interval and create durable timers whenever I receive a new data message for a device. I am trying to use durable timers pitted against the incoming data to achieve an event when we do _not_ receive data within a certain time threshold. It feels fairly simple. My event hub client just raises an event whenever new data comes in: ``` client.raiseEvent( deviceStaleDataOrchestrationInstanceId(deviceId), ExternalEvents.DataReceived, iotData ) ``` I then have an orchestration that is waits for either the time, or the raised event to complete. I'm using the eternal orchestration pattern to do this indefinitely. ``` const input = context.df.getInput(); const staleDataTimeoutTask = context.df.createTimer( input.messageTs.plus({ minutes: 30 }).toJSDate() ); const newDataActivity = context.df.waitForExternalEvent(ExternalEvents.DataReceived); const winner = yield context.df.Task.any([newDataActivity, staleDataTimeoutTask]); if (winner === newDataActivity) { staleDataTimeoutTask.cancel(); context.df.continueAsNew(newDataActivity.result); } else { logger.warn(`device OFFLINE alert after 30 minutes!`); ``` The behavior that I'm seeing is: - the first `yield context.df.Task.any([newDataActivity, staleDataTimeoutTask]);` receives the new data event as I would expect - when I `context.df.continueAsNew(newDataActivity.result);`, the new orchestration invocation has `isReplay = false` as I would expect - when I reach `yield context.df.Task.any([newDataActivity, staleDataTimeoutTask]);` again, `isReplay` becomes `true` and the originally `yield`ed value is retrieved, rather than waiting for a new data event. This creates an infinite loop for me locally. I tried augmenting the `continueAsNew` payload, changing one of the IoT data values to see if that changed behavior and it did not seem to. My understanding was that `continueAsNew` would wipe out the orchestration history, preventing any previous history / result history from impacting the newly spawned orchestration's results. Am I misunderstanding this behavior, or is there possibly some other behavior at play that I do not understand?
jasonvangundy commented 1 year ago

An additional data point - omitting the timer code altogether seems to make continueAsNew behavior execute as expected. Removing the timer code causes the const winner = yield context.df.Task.any([newDataActivity, staleDataTimeoutTask]); code to yield as expected and wait for a new data payload, rather than replaying the originally yielded value.

thegrekle commented 1 year ago

I've also experienced this and unsure why it behaves as it does. It's not intuitive, if this is indeed the intended behavior.

nytian commented 1 year ago

Hi, @jasonvangundy I am sorry for the inconvience and this behavior should not be expected. The Continue-as-new API should reset the history table. Can I ask what Durable Functions version and the language are you using? Thanks!

jasonvangundy commented 1 year ago

Thanks for the follow-up @nytian. I am using

Not sure it's relevant, but I experienced this issue on both the default storage provider as well as the netherite provider.

nytian commented 1 year ago

Thanks for the information provided, @jasonvangundy. DF v2.1.3 is a relatively old version. The latest one is v2.10.0. Can you try with the latest version to see if we still have the issue? By this we can make sure this is not a fixed bug.

jasonvangundy commented 1 year ago

I may be misunderstanding what you're asking for, but my answer was in reference to this dependency: https://www.npmjs.com/package/durable-functions?activeTab=versions. The latest 2.x.x version is 2.1.3. I have not yet upgraded to using 3.x yet. What you referenced appears to be a .NET dependency and has different versions.

If you're asking what version of the durable task framework I'm using, then I'm not entirely sure. In Azure I'm on the latest, running 2.10.0, and anxiously awaiting the next extension bundle release. But this issue was all experienced locally with the following host.json settings:

"extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[4.*, 5.0.0)"
  },
nytian commented 1 year ago

Sorry for the confusion. I will try reproducing the issue on my end and will update here later. Thanks!

jasonvangundy commented 3 months ago

@nytian Was there any progress here? I am actually hitting what I believe is this same issue in a different context, again. In the end it's similar code from a DF perspective. Basically:

const timeoutTask = context.df.createTimer(inFiveMinutes);
const externallyRaisedEventTask = context.df.waitForExternalEvent('configurationUpdate');
const winner = yield this.context.df.Task.any([
      timeoutTask,
      externallyRaisedEventTask,
]);
if (winner === externallyRaisedEventTask) {
      const updatedConfiguration = externallyRaisedEventTask.result as Configuration;
      const updatedInput = {
        ...originalOrchestrationInput,
        configuration: updatedConfiguration,
      };
      logger.info(`configurationUpdatedEvent: new input: ${JSON.stringify(updatedInput`);
      timeoutTask.cancel();
      context.df.continueAsNew(updatedInput);
      return;
 } 

This results in an infinite loop. My logs showcase that the behavior on continueAsNew is that when the the two tasks are yielded DF immediately responds with the externallyRaisedEventTask from the prior execution.

The expected behavior is that, because continueAsNew has been called, and these tasks are recreated, DF should be waiting for a new event to be raised.

Interestingly, I had a bug in my code originally. I had forgotten to call timeoutTask.cancel() when the external event won. In that case the timer halted execution for the remainder of its time, as expected. When the timer was up, the code executed as I would expect. This issue only cropped up when I added in the timeoutTask.cancel().

I humbly disagree that this issue should be categorized as "ease of use". To me it seems like I'm trying to use the framework exactly as it is documented and designed. Again, I am absolutely loving durable functions! This one has just reared its head on me multiple times now and is pretty painful to work around.

jasonvangundy commented 3 months ago

Another interesting addition. If I add a pause between the timer cancel and the continueAsNew, it does not cause an infinite loop. I.e.

timeoutTask.cancel();
yield context.df.createTimer(inFiveSeconds);
context.df.continueAsNew(updatedInput);

Is there potentially a race with the cancellation operation and the purging of history (as part of continueAsNew) such that some history remains behind and impacts subsequent executions?