dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

[API Proposal]: Cancel scopes #81874

Open agocke opened 1 year ago

agocke commented 1 year ago

Cancel scopes

This is the first proposal in what I hope to be a few proposals around a paradigm called "structured concurrency." The fundamental idea is that, by encouraging concurrent code to follow strict control flow and scoping rules, the overall code structure can be better understood and managed.

If you're not familiar with the concept I recommend reading the following articles by Nathan J. Smith, whose Trio library pioneered many of these concepts.

This proposal focuses on the theme in the first article: cancellation.

Background -- C# cancellation

First, some background on cancellation in C#. Cancellation in C# is almost entirely a library-level concept with no language integration. Users are expected to manually create and pass around CancellationTokens, which are then passed to library methods which have been written to take CancellationToken parameters. Methods which support cancellation are meant to regularly check the token for cancellation and throw if it has been cancelled, and consumers are meant to wrap code in try-catch blocks to capture OperationCanceledExceptions.

This creates various problems, as described in the article, including:

Proposal

To address these problems, I propose we introduce a new CancelScope type. The type definition looks like following.

sealed class CancelScope
{
    public static CancelScope? CurrentScope { get; }

    public static Task OpenScope(Func<CancelScope, Task> func);

    public CancellationTokenSource Cts { get; }
    ...
}

This is an incomplete definition, meant to introduce the proposal, rather than explain all the details.

Using a CancelScope is similar to using a CancellationToken and CancellationTokenSource today, except that instead of passing the token to nested async calls, you open a new nested CancelScope with CancelScope.OpenScope.

async Task UserMethod()
{
    await CancelScope.OpenScope(newScope =>
    {
        // Do whatever you would normally do
        // newScope.Cts.CancelAfter(50);
        // await Task.Delay(20);
        // await client.GetAsync();
        // await UserMethod2();
    });
}

Now, instead of having to pass a cancellation token to each method, transitively, a user only has to pass the token once, to the WithScope method. In addition to holding and passing tokens and token sources, CancelScope also provides a mechanism for handling cancellation. When cancellation is triggered for a given CancelScope, that CancelScope catches OperationCanceledException for its token source.

This addresses how to store cancellation tokens -- the next question is how to access them.

In general, the current cancellation token is always accessible via CancelScope.CurrentScope?.Cts.CancellationToken. If a method wanted to query the current token for cancellation, it could access the above property manually. However, this is laborious and is one of the problems we'd like to solve.

Instead of directing every library author to manually check for cancellation, I propose that we check for cancellation on every await. As described in the above article on the Trio library, this provides an easily syntactically identifiable place for cancellation to happen, and it should be a place where users expect their method to potentially leave their control.

With automatic checking for cancellation, most cancellation tasks should now be fairly easy. Of course, manual use of CancellationTokenSource and CancellationToken will be required for advanced scenarios, but that should rightly be the minority of uses.

Implementation details

I do not know the async machinery in full detail, so this is only one option for implementation. Most importantly, CancelScope.CurrentScope would be backed by an AsyncLocal variable that is updated every time a new scope is introduced. In particular, the implementation looks like this:

class CancelScope
{
    [AsyncLocal]
    private static AsyncLocal<CancelScope?> s_currentScope = null;
    public static CancelScope? CurrentScope => s_currentScope.Value;

    public CancellationTokenSource Cts { get; }

    private CancelScope()
    {
        Cts = new CancellationTokenSource();
    }

    public static Task OpenScope(async Func<CancelScope, Task> func)
    {
        var oldScope = s_currentScope;
        var newScope = new CancelScope();
        try
        {
            s_currentScope.Value = newScope;
            await func(newScope);
        }
        catch (OperationCanceledException e) when (e.CancellationToken == newScope)
        {
            // Do nothing, we could add an overload with another func to run here
        }
        finally
        {
            s_currentScope = oldScope;
        }
    }
}

This implementation should allow scopes to be nested and for each task to retrieve the appropriate value.

As mentioned above, the other piece of this proposal is to check for cancellation on every await. This could be accomplished by calling CancelScope.CurrentScope?.Cts.CancellationToken.ThrowIfCancellationRequested() at the beginning of the GetAwaiter() method.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-threading-tasks See info in area-owners.md if you want to be subscribed.

Issue Details
# Cancel scopes This is the first proposal in what I hope to be a few proposals around a paradigm called "structured concurrency." The fundamental idea is that, by encouraging concurrent code to follow strict control flow and scoping rules, the overall code structure can be better understood and managed. If you're not familiar with the concept I recommend reading the following articles by Nathan J. Smith, who's Trio library pioneered many of these concepts. - [Timeouts and cancellation for humans](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/) - [Structured concurrency](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/) This proposal focuses on the theme in the first article: cancellation. ## Background -- C# cancellation First, some background on cancellation in C#. Cancellation in C# is almost entirely a library-level concept with no language integration. Users are expected to manually create and pass around CancellationTokens, which are then passed to library methods which have been written to take CancellationToken parameters. Methods which support cancellation are meant to regularly check the token for cancellation and throw if it has been cancelled, and consumers are meant to wrap code in `try-catch` blocks to capture `OperationCanceledException`s. This creates various problems, as described in the article, including: * Passing cancellation tokens through code is laborious and easily overlooked * API writers can often forget to wire through cancellation even though they should * Libraries that wish to support cancellation must manually add checks to see if the cancellation token has been triggered * Callers must manually handle `OperationCanceledException` ## Proposal To address these problems, I propose we introduce a new CancelScope type. The type definition looks like following. ```C# sealed class CancelScope { public static CancelScope? CurrentScope { get; } public static Task OpenScope(Func func); public CancellationTokenSource Cts { get; } ... } ``` This is an incomplete definition, meant to introduce the proposal, rather than explain all the details. Using a CancelScope is similar to using a CancellationToken and CancellationTokenSource today, except that instead of passing the token to nested async calls, you open a new nested CancelScope with `CancelScope.OpenScope`. ```C# async Task UserMethod() { await CancelScope.OpenScope(newScope => { // Do whatever you would normally do // newScope.Cts.CancelAfter(50); // await Task.Delay(20); // await client.GetAsync(); // await UserMethod2(); }); } ``` Now, instead of having to pass a cancellation token to each method, transitively, a user only has to pass the token once, to the `WithScope` method. In addition to holding and passing tokens and token sources, CancelScope also provides a mechanism for handling cancellation. When cancellation is triggered for a given CancelScope, that CancelScope catches OperationCanceledException for its token source. This addresses how to store cancellation tokens -- the next question is how to access them. In general, the current cancellation token is always accessible via `CancelScope.CurrentScope?.Cts.CancellationToken`. If a method wanted to query the current token for cancellation, it could access the above property manually. However, this is laborious and is one of the problems we'd like to solve. Instead of directing every library author to manually check for cancellation, I propose that we check for cancellation on every `await`. As described in the above article on the Trio library, this provides an easily syntactically identifiable place for cancellation to happen, and it should be a place where users expect their method to potentially leave their control. With automatic checking for cancellation, most cancellation tasks should now be fairly easy. Of course, manual use of CancellationTokenSource and CancellationToken will be required for advanced scenarios, but that should rightly be the minority of uses. ## Implementation details I do not know the async machinery in full detail, so this is only one option for implementation. Most importantly, `CancelScope.CurrentScope` would be backed by an `AsyncLocal` variable that is updated every time a new scope is introduced. In particular, the implementation looks like this: ```C# class CancelScope { [AsyncLocal] private static AsyncLocal s_currentScope = null; public static CancelScope? CurrentScope => s_currentScope.Value; public CancellationTokenSource Cts { get; } private CancelScope() { Cts = new CancellationTokenSource(); } public static Task OpenScope(async Func func) { var oldScope = s_currentScope; var newScope = new CancelScope(); try { s_currentScope.Value = newScope; await func(newScope); } catch (OperationCanceledException e) when (e.CancellationToken == newScope) { // Do nothing, we could add an overload with another func to run here } finally { s_currentScope = oldScope; } } } ``` This implementation should allow scopes to be nested and for each task to retrieve the appropriate value. As mentioned above, the other piece of this proposal is to check for cancellation on every `await`. This could be accomplished by calling `CancelScope.CurrentScope?.Cts.CancellationToken.ThrowIfCancellationRequested()` at the beginning of the `GetAwaiter()` method.
Author: agocke
Assignees: -
Labels: `area-System.Threading.Tasks`
Milestone: -
Clockwork-Muse commented 1 year ago

When cancellation is triggered for a given CancelScope, that CancelScope catches OperationCanceledException for its token source.

Does it just swallow it? Or something else? Note that the point of the type is already to be deliberately caught. OCE/TaskCanceledException are "expected exceptions" (like most I/O exceptions), and one you're encouraged to have control flow for.

That said, this is actually a duplicate: https://github.com/dotnet/runtime/issues/29881

agocke commented 1 year ago

Does it just swallow it? Or something else

See the implementation details. Basic API just swallows and moves on. Another delegate could be provided for cancel-specific behavior.

That said, this is actually a duplicate: https://github.com/dotnet/runtime/issues/29881

Not the same. This proposal is much more comprehensive with stricter semantics.

stephentoub commented 1 year ago

We've tried this multiple times in the past, once with a CancellationRegion and once with a CancellationScope, the latter of which was almost identical to what's being proposed (though it accepted a CancellationToken to a scope constructor instead of creating a CancellationTokenSource implicitly, and it had alternate designs for whether to run all the work in a delegate, in which case it could choose to eat cancellation exceptions, or just a using block that didn't have the overhead of the extra async lambda but didn't eat exceptions). We didn't ship them because there were too many problems with the design.

Many of the problems are to do with composability, in a variety of ways. It's not uncommon that an operation needs to perform some work using something that is cancelable but that must not be canceled even if the overall operation is, e.g. queueing a work item to handle asynchronous completion, regardless of the reason for that completion (which could include cancellation), such that if the work item were canceled, the whole thing would deadlock. It's also not uncommon to have multiple sources of cancellation, e.g. some external trigger as well as an internal trigger, for example someone wanting to be able to explicitly cancel an HTTP request but the HTTP implementation also needing to respect a timeout. The ambient nature of a proposal like this makes it incredibly difficult for implementations to get all variations of this "right". Should they ignore the ambient token if one exists? Should they combine it with one explicitly provided? There's no one right answer for everything. And this is made all the more complicated by us now having over a decade of libraries built with assumptions around this that will be violated based on whatever decisions were to be made.

There's also performance ramifications. Flowing state via ExecutionContext / AsyncLocals is not free. Every mutation to an AsyncLocal's value results in an allocation, creating a new immutable object to store the new state. And with the current implementation, which has evolved over the years to fine-tune it for current uses, the more async locals you have, the more expensive they all become: the async local values are all stored together in a single immutable object, such that changing any of them requires allocating a new object large enough to store the values for all of them. The proposed design would make it more common to both use and mutate AsyncLocals, and would thereby put increased pressure on the rest of the system. There's also the read cost; practically every method that today accepts a CancellationToken would effectively need to query a [ThreadLocal] that stores the current ExecutionContext containing those AsyncLocals, and perform a dictionary lookup to find the right state. That's going to be an increase in the peanut butter across the system. It's also not something that can easily be made lazy; many methods that accept CancellationTokens do an upfront cancellation check, and for consistency you'd want them to be checking the ambient token, plus doing whatever combination logic was needed to check against both the passed in token and the ambient one. On top of that you have the costs of the extra async state machine for the lambda, plus delegate allocation / closure allocation, etc. each time one of those delegate-based scopes is created.

You also have all existing code to contend with. Existing code won't be paying attention to the ambient scope, and folks just switching from passing in a token to instead using a scope will be sorely disappointed when many of the things they were hoping to be canceled aren't. With regards to "I propose that we check for cancellation on every await", that hits all the same problems previously described. It also doesn't help with leaf implementations that aren't using async/await because async/await is only applicable to composition, not to implementing the leaf operations that represent the actual I/O.

I understand the benefits of the proposed approach, and I understand why it's attractive in some ways. It's why we've invested in such an approach multiple times in the past. I'm not particularly excited about trying again, and I'm not optimistic the outcome would be any better.

agocke commented 1 year ago

@stephentoub Thanks for the comment! It has tons of useful detail and background.

One big takeaway I have, though, is that you didn't seem to disagree with the top-level points about problems in our cancellation story. It costs us a lot as we need to add, plum through, and check tokens, it costs libraries authors a lot to do the same, and it costs users who have to do all that, plus navigate the clutter that it adds to their APIs.

Overall, it feels like the vast majority of user cases are not being satisfied by our current system. I would bet that the vast majority of customers only deal with one token that they pass down to all their callees.

I know the challenge seems impossible, but I don't see how we can avoid continuing to try to solve this. Other platforms and tools have seemed to demonstrate that a simpler model is possible and having ours remain super complicated feels like stagnation.

To that end, I'd really like to try to really pan out and start thinking of solutions that we might have ruled out before. For instance, you mentioned that AsyncLocal perf was an impediment. Is there a change that we could make in the language, the runtime, or both to make AsyncLocal a lot cheaper? Or, maybe it's expensive in the very general case, but is there a really common case (say, where every async call is immediately awaited) where we could make it substantially cheaper?

I'm open to a lot of solutions here, I'd just like to see us build something transformative for users.

agocke commented 1 year ago

Btw, one example

, or just a using block that didn't have the overhead of the extra async lambda but didn't eat exceptions)

This is a super, super simple thing to fix in C# -- we should just add the ability for IDisposable, or a new inheriting interface, to handle exceptions. It seems like a minor oversight that we could easily fix if we were really worried about delegate cost.

stephentoub commented 1 year ago

One big takeaway I have, though, is that you didn't seem to disagree with the top-level points about problems in our cancellation story.

I do disagree, actually. You cited the following:

Callers must manually handle OperationCanceledException

Regardless of the model, callers need to react to cancellation happening just as they react to other failures that caused the code to not run to completion; just eating cancellation on behalf of someone and making the difference between success and cancellation evaporate leads to reliability problems. Developers writing such code to react to cancellation is not a flaw, it's a necessity; moreover, it would continue to be required with the proposal, even if there's ultimately a back stop for such exceptions. Catching the exceptions automatically would just make it more likely programs have bugs by papering over the fact that something which was supposed to happen didn't actually happen. There are of course cases where nothing need be done to react to cancellation occurring, but in my experience that's the significant minority case, not the majority.

Libraries that wish to support cancellation must manually add checks to see if the cancellation token has been triggered

Not really. Manual polling for cancellation is optional. If you're just composing other operations, the 99% case is you just pass along the token to those other operations. As an optimization, APIs will often do an upfront check for cancellation, but that's just to avoid doing any work if it's likely cancellation will have already been requested. And sometimes devs will add additional polling locations strategically placed just to increase responsiveness to cancellation in otherwise uninterruptable blocks. But you only need to write special code to react to cancellation when you're transitioning into something that doesn't already represent cancellation via CancellationToken, at which point you Register with the token to write the code that will forward along the cancellation signal in some manner. That code is necessary to bridge the worlds, and it would continue to be necessary with this proposal.

API writers can often forget to wire through cancellation even though they should

Yes, if you have a cancellation token, you might neglect to pass it to another method that expects one. That's why rules like CA2016 now exist:

image

https://learn.microsoft.com/en-us/dotnet/fundamentals/code-analysis/quality-rules/ca2016 If there are tweaks to this rule we need to make to help developers more, or if there are others rules we should be putting in place to aid developers, I'm all for improvements in the area.

Passing cancellation tokens through code is laborious and easily overlooked

I can sympathize with not loving passing the cancellation token around. I personally wouldn't, however, call it "laborious". If we could make this easier without incurring other negatives, I'd be interested in understanding what are those options. But I do believe something that can cause arbitrary code to fail needs to be forefront in the code, and I worry that having ambient signals affecting API calls without that showing up at all sites will actually be a significant net negative on reliability. If every function someone writes might be interrupted with exceptions at effectively arbitrary points, we're dangerously approaching a world of thread interrupts and thread aborts.

agocke commented 1 year ago

Good feedback. I disagree that dealing with cancellation tokens isn't laborious. It's a mostly mechanical operation that slows down my development and requires extra thought.

That said, I appreciate the feedback that the cure might be worse than the disease. I think the next step is for me to try to come up with a compelling sample of problems where a scoped solution makes things better without making anything else worse.

My suspicion is that your bar is pretty high here, so I'll try to write an actual library to demo this experience. Getting things to be accurate may be difficult, since I have to find clever ways to hack into the existing system, so it might take a little time.

timonkrebs commented 10 months ago

What about going more into a declarative direction? I made this library that explores a more declarative way to concurrency: https://github.com/timonkrebs/MemoizR#declarative-structured-concurrency

This approach should take away a lot of the burden of handling complex concurrent cancellation/error-handling. It is still in development but I think it already can add value for some cases.

This approach should also be able to handle resources for concurrency scenarios with the same cancellation and error-handling capabilities. Something like: https://github.com/StephenCleary/StructuredConcurrency#resources

PS: The Link in the description should be: https://vorpus.org/blog/timeouts-and-cancellation-for-humans/