Open ranasch opened 9 months ago
Amazing bug report @ranasch, I've been facing the exact same issue and had just made a striped down example and saw this when I came to post.
So it's not a complete waste of time here is the basic example I came up with : Functions V4, All latest relevant packages, .net8 Isolated, tested locally with Azurite.
using System;
using System.Threading.Tasks;
using Microsoft.DurableTask;
using Microsoft.DurableTask.Client;
using Microsoft.DurableTask.Entities;
using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Http;
public class EntityLockExample
{
[Function(nameof(EntityLockExample.EntityLockExampleTrigger))]
public async Task EntityLockExampleTrigger([HttpTrigger(AuthorizationLevel.Function, "get", Route = $"{nameof(EntityLockExample.EntityLockExampleTrigger)}/{{throwError}}")] HttpRequestData requestData, bool? throwError, [DurableClient] DurableTaskClient durableTaskClient)
{
await durableTaskClient.ScheduleNewOrchestrationInstanceAsync(new TaskName(nameof(this.EntityLockExampleOrchestration)), throwError.GetValueOrDefault(false));
}
[Function(nameof(EntityLockExample.EntityLockExampleOrchestration))]
public async Task EntityLockExampleOrchestration([OrchestrationTrigger] TaskOrchestrationContext orchestrationContext, bool throwError)
{
EntityInstanceId entityLock = new EntityInstanceId(nameof(EntityLockExample), "myLock");
await using (await orchestrationContext.Entities.LockEntitiesAsync(entityLock))
{
await orchestrationContext.CallActivityAsync(new TaskName(nameof(this.EntityLockExampleActivity)), throwError);
}
}
[Function(nameof(EntityLockExample.EntityLockExampleActivity))]
public void EntityLockExampleActivity([ActivityTrigger] bool throwError)
{
if (throwError)
{
throw new Exception("Activity Failed");
}
}
}
With this code you can run it as many times as you want passing throwError = false But as soon as you pass one throwError = true the lock will never be released and all other subsequent requests will be held forever. The lock does know more and more tasks are in the queue though so that is getting updated.
I also tried to capture any exceptions inside the critical section and rethrow them outside the using block after the lock had been disposed but this sadly did not help.
Another "workaround" I have at the moment is to never allow an orchestration that uses critical sections to enter a failed state. To do this I capture any exception in the activities and store that in a custom state. This is less than ideal though as using the DFMonitor it looks like everything is running fine when really there may be many issues.
I have read some other tickets related to this issue and people where asking about the Netherite storage engine which I hoped would solve this issue but sadly I can't get this to work with .net8 Isolated. Might be uncouth but here is a link to that issues in the Netherite repo.
Thanks @sebastianburckhardt for already assigning this to yourself, if you have any other ideas or work around that would be ace.
I have been able to confirm that the current implementation does not properly release the lock when an exception is thrown inside a critical section. I have not yet been able to determine why though.
The problem seems to be with how OutOfProcMiddleware.CallOrchestratorAsync
is implemented. Specifically, when an orchestrator function fails, the OrchestratorExecutionResult
returned by the remote worker is ignored, and instead an alternate OrchestratorExecutionResult.ForFailure
is constructed from the exception returned by the function result. This is not a good idea because (a) the original failure details are lost, which contain the actual stack trace of the exception in user code, and (b) any orchestrator actions that took place prior to the completion (e.g. sending lock release messages) are ignored.
I think the solution is to, if there is a failure, first try to properly use the original execution result (if that execution result is also a failure), and only construct a new failure result if the original execution result is not a failure.
PR #2748 contains a fix for this issue and (barring any complications) can go out with the next release.
Correction, sadly this issue is not resolved I must have tested it incorrectly... My original "EntityLockExample" code can be used to demonstrate this with 1.1.2 Trying to test now with 1.1.3 and my code can't seem to even take a lock out now even after fully resetting azurite just encase something was stuck in a queue.
I can confirm with the latest release on nuget this issue is resolved. Thanks @sebastianburckhardt for working on it.
@sebastianburckhardt @DEndersby91 Not sure how you confirmed it. When I upgraded Microsoft.Azure.Functions.Worker.Extensions.DurableTask from 1.1.1 to 1.1.3 in my sample above, I don´t get a lock anymore. This call does not return:
var entityHandle = await context.Entities.LockEntitiesAsync(entityId);
What am I missing?
You've made me worried now. I tested the above code I linked in my original commnet and it no longer locks up for me after an error. Let me add some delays inside the activity and queue a few up and see if the lock is taken out.
Are you testing with an existing hub that already has a locked up entity? I don't think this fix will correct things historically so some cleaning up might be required.
You've made me worried now. I tested the above code I linked in my original commnet and it no longer locks up for me after an error. Let me add some delays inside the activity and queue a few up and see if the lock is taken out.
Are you testing with an existing hub that already has a locked up entity? I don't think this fix will correct things historically so some cleaning up might be required.
No - testing locally with Azurite and tried with existing hub as well as discarding/ creating a new hub from scratch. Same result
You're 100% right @ranasch, I don't know what I was testing before but this issue is sadly not resolved with the current 1.1.2 release. Trying to test now with 1.1.3 and I can't seem to take a lock out even when using a fresh azurite instance.
Ok so reading other issues this is might be caused by entites not getting picked up : https://github.com/Azure/azure-functions-durable-extension/issues/2830
1.1.4 seems to fix this for me
This is happening to me in 1.1.4. I'm occasionally getting errors like this
Exception while executing function: Functions.KccConsumerOrchestrationFunction Did not find any initialized language workers
When this happens, the orchestration fails, and entity locks are not released without manual intervention to restart the orchestration.
Description
Running .net8 isolated durable function which has a critical section guarded by a durable entity. When the orchestration fails, the entity keeps locked and no further orchestrations can proceed.
Expected behavior
The the critical section (using) is left via exception, the build in dispose should release the lock of the orchestration on the entity, otherwise it stays locked forever and no other orchestration can obtain a lock.
Actual behavior
When the critical section is left via exception, the lock is not released on the entity, because the release is not persisted on the Azure Table.
Relevant source code snippets
The created entity will stay be locked by the critical section, unless you also implement a custom Dispose, which by calling the Entity a final time ensures the unlock state is persisted in the table storage:
Unfortunately this approach has a catch with throwing ambiguous workflow exception. So its not the solution.
Known workarounds
Tried the custom dispose, however that throws this exception:
App Details
Project:
settings:
Screenshots
Lock before disposing: Entity unlocked after custom callout in dispose:
If deployed to Azure
This is a simplified excerpt of our production function. For production details reach out to me directly to get a repro.