Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.53k stars 296 forks source link

Process crashes and hangs when passing parameters that contain CancellationToken #1158

Open cgillum opened 2 months ago

cgillum commented 2 months ago

There have been several reports of CLR crashes and process hangs when using DTFx, specifically when a call to an activity function, continue-as-new, etc. passes in a parameter that contains a reference to CancellationToken. This also happens when OrchestrationContext.CreateClient<T> is used and methods in T interface with CancellationToken in their signatures.

The root issue is that the serialization code used by DTFx will serialize CancellationToken objects in a way that can corrupt CLR memory, resulting in unexpected behavior.

The mitigation is to remove CancellationToken parameters from interfaces that are wrapped using OrchestrationContext.CreateClient<T> or from argument values that are passed to ScheduleTask*, ContinueAsNew, etc.

Emeka-MSFT commented 2 months ago

Could another workaround be to change the default serialization class and/or parameters?

Emeka-MSFT commented 2 months ago

By default, all DTFx async methods must support cancellation token as the last parameter. From what I noticed, it is a current implementation gap, i.e. bug.

cgillum commented 2 months ago

Could another workaround be to change the default serialization class and/or parameters?

Yes, but this would be a breaking change under normal circumstances. It might be possible to add some customization to look for this case, specifically, and avoid serializing CancellationToken if we see it, though I'm not sure how exactly it would need to be implemented.

By default, all DTFx async methods must support cancellation token as the last parameter. From what I noticed, it is a current implementation gap, i.e. bug.

If this was in reference to my point about removing CancellationToken parameters from wrapped interfaces, I was referring to user-provided interfaces, not DTFx interfaces.

mkraghavan commented 2 months ago

By default, all DTFx async methods must support cancellation token as the last parameter. From what I noticed, it is a current implementation gap, i.e. bug.

Do we have an ETA for when this will be fixed? This is hitting us bad in our production workloads and we are forced to restart orchestrators intermittently to work around this issue.

cgillum commented 2 months ago

@mkraghavan what fix are you referring to? The process crash mentioned in this issue can be worked around by removing CancellationToken from your list of parameters. Also, the sentence you're quoting is not related and not something that is going to be changed anytime soon.