Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
714 stars 264 forks source link

Timeout still occurring on rewind #2350

Open s-pilo opened 1 year ago

s-pilo commented 1 year ago

When attempting to rewind an orchestration that waits on an external event, instead of creating a timer based on the current time, it uses the timestamp from the original orchestration execution. This is not ideal in scenarios where a downstream issue has been resolved and we wish to rewind in order to retry, rather than restart the entire orchestration.

In my case, I wait on an external event, and then have another durable function that listens for a response on service bus, and raises the event when we get a response. I'm wondering if this is by design, or an oversight, and if by design how I should go about fixing this case.

https://github.com/Azure/azure-functions-durable-extension/blob/2f830a8dfd6629bc75f300b90a481e9df7d013da/src/WebJobs.Extensions.DurableTask/ContextImplementations/DurableOrchestrationContext.cs#L998

cgillum commented 1 year ago

This is a known issue with the current design of timers and rewind, unfortunately. It's something we want to revisit when we get around to improving the rewind API and making it "GA".

Adding @jviau, @sebastianburckhardt, and @lilyjma since we will want to discuss this when it comes time for us to revisit the rewind implementation and prepare it for GA.

jviau commented 1 year ago

@s-pilo can you clarify the state your orchestration was in when you ran retry? Had it already processed the external event, or was it still waiting on the external event?

s-pilo commented 1 year ago

@jviau it had timed out and thrown an exception, so was in a failed state. I was attempting to rewind.

jviau commented 1 year ago

@s-pilo, at what point did it timeout though? Did it timeout before or after receiving the event?

s-pilo commented 1 year ago

@s-pilo, at what point did it timeout though? Did it timeout before or after receiving the event?

Before. My team and I are talking about either re-architecting to eliminate the reliance on the timeout, or some other solution until rewind is revisited and hopefully this scenario is supported.