dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

[API Proposal]: Add time properties to CancellationToken #59312

Closed fernacolo closed 2 years ago

fernacolo commented 3 years ago

Background and motivation

Modern REST APIs allow the client to specify a server timeout for the operation. An example is the service timeout parameter of the Azure Blob API. However, the CancellationToken type cannot be queried for the time left or the cancellation time, preventing proper leverage of server timeouts. The consequence is that the cancellation is captured by HttpClient, NetworkStream or some other network component, which often causes an abrupt interruption such as closing a TCP connection or at least leaving it in non-idle state for longer than required, which may lead to excess of consumption of network resources such as SNAT ports.

Furthermore, many SDK APIs today have to declare two parameters for aborting an operation earlier: one is the timeout as a scalar value, and the other is the CancellationToken. By adding time properties in the CancellationToken, there is a potential of simplifying future APIs.

API Proposal

namespace System.Threading
{
     public struct CancellationToken
     {
          // The estimated time for the scheduled cancellation, if any.
          // If there is no scheduled time for cancellation or if the time is not known, this
          // property returns DateTimeOffset.MaxValue.
          // If cancellation already happened, this property returns the time of cancellation,
          // which may be in the past.
          public DateTimeOffset CancellationTime { get; }

          // The time left before the scheduled cancellation.
          // This is a convenience method that computes a value based on current time and
          // the value of CancellationTime. If there is no scheduled time for cancellation, or
          // if the time is not known, this property returns TimeSpan.MaxValue.
          // If cancellation already happened, this property returns TimeSpan.Zero.
          public TimeSpan TimeLeft { get; }

          // The time left before the scheduled cancellation, in milliseconds.
          // This is a convenience method that computes a value based on current time and
          // the value of CancellationTime. If there is no scheduled time for cancellation, or
          // if the time is not known, this property returns the value of Timeout.Infinite.
          // If cancellation already happened, this property returns zero.
          public int MillisecondsTimeLeft { get; }
     }
}

Additional notes:

API Usage

// Use a simplified hypothetical BlobClient.
public async Task<byte[]> SafeDownloadBlobBytes(BlobClient client, string blobUri, CancellationToken ct)
{
    int timeLeft = ct.TimeLeft;
    if (timeLeft > TimeSpan.Zero) {
        // Allow 200ms for the response to reach this client after server timeout,
        // and assuming the client tolerates negative values.
        client = client.WithServerTimeout(timeLeft - TimeSpan.FromMilliseconds(200));
    }
    return await client.DownloadBlobBytes(blobUri, ct);
}

Risks

I think the main risks are misuse of the new properties in the following ways:

These risks can be mitigated with good documentation, samples and a blog entry in the announcement.

I don't see risk of performance regression because all those properties can be derived from a new primitive field added to CancellationTokenSource or the underlying Timer. It is possible that the value already exists and only needs to be retrieved. The system ought to have this value already, but I suppose it might be entrenched in some unmanaged code or OS API, making it difficult to retrieve on demand. That's why a new field may have to be added in the CancellationTokenSource.

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 3 years ago

Tagging subscribers to this area: @dotnet/area-system-threading-tasks See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Modern REST APIs allow the client to specify a server timeout for the operation. An example is the [service timeout](https://docs.microsoft.com/en-us/rest/api/storageservices/setting-timeouts-for-blob-service-operations) parameter of the Azure Blob API. However, the `CancellationToken` type cannot be queried for the time left or the cancellation time, preventing proper leverage of server timeouts. The consequence is that the cancellation is captured by `HttpClient`, `NetworkStream` or some other network component, which often causes an abrupt interruption such as closing a TCP connection or at least leaving it a non-idle state for longer than required time, which may lead excess of consumption of network resources such as SNAT ports. Furhtermore, many SDK APIs today have to declare two parameters for aborting an operation earlier: one is the timeout as a scalar value, and the other is the `CancellationToken`. By adding time properties in the `CancellationToken`, there is a potential of simplifying future APIs. ### API Proposal ```C# namespace System.Threading { public struct CancellationToken { // The estimated time for the scheduled cancellation, if any. // If there is no scheduled time for cancellation or if the time is not known, this // property returns DateTimeOffset.MaxValue. // If cancellation already happened, this property returns the time of cancellation, // which may be in the past. public DateTimeOffset CancellationTime { get; } // The time left before the scheduled cancellation. // This is a convenience method that computes a value based on current time and // the value of CancellationTime. If there is no scheduled time for cancellation, or // if the time is not known, this property returns TimeSpan.MaxValue. // If cancellation already happened, this property returns TimeSpan.Zero. public TimeSpan TimeLeft { get; } // The time left before the scheduled cancellation, in milliseconds. // This is a convenience method that computes a value based on current time and // the value of CancellationTime. If there is no scheduled time for cancellation, or // if the time is not known, this property returns the value of Timeout.Infinite. // If cancellation already happened, this property returns zero. public int MillisecondsTimeLeft { get; } } } ``` Additional notes: - The source of the information should probably be stored in the `CancellationTokenSource`. However because the primary scenario is for code that sees the `CancellationToken`, the properties should be specified in the `CancellationToken`. - The value returned by the `CancellationTime` property may change if the method `CancellationTokenSource.CancelAfter` is called. This must be well-documented. - The time properties should be provided as best-effort or estimates of the real values, as it is common for timeout processing. For example, in case of high CPU usage that prevents cancellation processing, the actual cancellation may happen after the reported time. This must be well-documented. - The time properties do not prevent cancellation to be triggered before the scheduled time. If some code calls `CancellationTokenSource.Cancel`, the cancellation would happen before the scheduled time. This must be well-documented. ### API Usage ```C# // Use a simplified hypothetical BlobClient. public async Task SafeDownloadBlobBytes(BlobClient client, string blobUri, CancellationToken ct) { int timeLeft = ct.TimeLeft; if (timeLeft > TimeSpan.Zero) { // Allow 200ms for the response to reach this client after server timeout, // and assuming the client tolerates negative values. client = client.WithServerTimeout(timeLeft - TimeSpan.FromMilliseconds(200)); } return await client.DownloadBlobBytes(blobUri, ct); } ``` ### Risks I think the main risks are misuse of the new properties in the following ways: - The developer presumes that cancellation never happens when `TimeLeft` is `TimeSpan.MaxValue`. In fact, cancellation may be triggered by `CancellationTokenSource.Cancel`. - The developer presumes that `TimeLeft` represents real time left and decides to implement timeout without actually listening for cancellation. In fact, cancellation may be triggered earlier by `CancellationTokenSource.Cancel`, and `TimeLeft` can be further reduced by a new call to `CancellationTokenSource.CancelAfter`. - Bad implementation of time arithmetics. These risks can be mitigated with good documentation, samples and a blog entry in the announcement. I don't see risk of performance regression because all those properties can be derived from a new primitive field added to `CancellationTokenSource` or the underlying `Timer`. It is possible that the value already exists and only needs to be retrieved. The system ought to have this value already, but I suppose it might be entrenched in some unmanaged code or OS API, making it difficult to retrieve on demand. That's why a new field may have to be added in the `CancellationTokenSource`.
Author: fernacolo
Assignees: -
Labels: `api-suggestion`, `area-System.Threading.Tasks`, `untriaged`
Milestone: -
Clockwork-Muse commented 3 years ago

Generally speaking, most higher-level APIs should be passing tokens, and those few cases requiring timeouts will merge token sources.

The value returned by the CancellationTime property may change if the method CancellationTokenSource.CancelAfter is called. This must be well-documented.

... this is the only way for it to be set in the first place... from C#. The OS gets to dictate this, too, as well as other client/server configs, including hardware drivers.

// The estimated time for the scheduled cancellation, if any.
// If there is no scheduled time for cancellation or if the time is not known, this
// property returns DateTimeOffset.MaxValue.
// If cancellation already happened, this property returns the time of cancellation,
// which may be in the past.
public DateTimeOffset CancellationTime { get; }

Aside from the fact that it may be more appropriate for the property to be public DateTimeOffset? CancellationTime { get; }, the value returned is an estimate for another reason - generally speaking a monotonic clock (ie, a strict count of seconds) should be used for things like timeouts, because the computer clock may change it's understanding of "current" time independently.

// The time left before the scheduled cancellation.
// This is a convenience method that computes a value based on current time and
// the value of CancellationTime. If there is no scheduled time for cancellation, or
// if the time is not known, this property returns TimeSpan.MaxValue.
// If cancellation already happened, this property returns TimeSpan.Zero.
public TimeSpan TimeLeft { get; }

Should probably return Timeout.Infinite as well.

The developer presumes that TimeLeft represents real time left and decides to implement timeout without actually listening for cancellation. In fact, cancellation may be triggered earlier by CancellationTokenSource.Cancel, and TimeLeft can be further reduced by a new call to CancellationTokenSource.CancelAfter.

CancelAfter also allows the time left to be increased.

Note that tokens can be reset in the next release.

fernacolo commented 3 years ago

@Clockwork-Muse, thanks for your comments. Can you elaborate more on "merge token sources" below?

Generally speaking, most higher-level APIs should be passing tokens, and those few cases requiring timeouts will merge token sources.

Clockwork-Muse commented 3 years ago

Most APIs don't have (or need) an override with a timeout parameter, because in many cases the cancellation will either be user driven (click a cancel button, in which case no timeout), or by a lower-level config. In the config cases the API used won't take a timeout but only a token, after which point the client would create its own token source by calling CreateLinkedTokenSource().

In your example use case, you're using TimeLeft to drive an "independent" cancellation time, which feels strange - my naïve gut reaction is that you're just going to eat up extra time by sectioning things off that way (and probably cancelling a few things that would otherwise have succeeded), rather than help drive faster cancels. In other words, it feels like things should be adding time, not subtracting it.

Clockwork-Muse commented 3 years ago

The consequence is that the cancellation is captured by HttpClient, NetworkStream or some other network component, which often causes an abrupt interruption such as closing a TCP connection or at least leaving it in non-idle state for longer than required, which may lead to excess of consumption of network resources such as SNAT ports.

Also, I don't know enough about the Hardware/OS/Networking stack, but this feels wrong somehow. In most cases your connections are expected to succeed, which will end just as abruptly.

fernacolo commented 3 years ago

@Clockwork-Muse

Server applications and daemon applications running on personal devices often declare timeouts at their own settings and create an explicit cancellation token with the configured timeout, either directly or joining an existing token with CreateLinkedTokenSource.

In the config cases the API used won't take a timeout but only a token

That's exactly the problem! The low-level code that takes such token has no hint about the timeout. This precludes efficient logic such as forwarding the timeout to the backend service, as explained in the blob endpoint, and further below.

Consider code that calls a remote service running on HTTP/1.1 which accepts a server timeout parameter. Without the server timeout, the remote service would typically use a large value and keep running for long after the cancellation was triggered in the client. The client is then forced to either drop the HTTP/1.1 connection or wait for it to finish and drain the response, which might be huge. Both choices present risk of SNAT port exhaustion, which can be catastrophic depending on E2E retry implementations. Not to mention that a huge response to drain would be a waste of server resources and network bandwidth.

By knowing the time left for the request (from CancellationToken properties), client code can submit the server timeout, which would cause the service to return HTTP 412 Not Enough Time Given (a variant of Precondition Failed). The request finishes early at the remote service, and the connection stays available for other requests. Furthermore, because the remote service returns an error condition and not a real response, server resources and bandwidth are not wasted.

Also, I don't know enough about the Hardware/OS/Networking stack, but this feels wrong somehow. In most cases your connections are expected to succeed, which will end just as abruptly.

The properties are not intended to help the happy cases where the cancellation is never triggered. They are intended to work as a hint for handling the exceptional cases where the cancellation would be triggered. Low-level IO code often benefits from knowing the timeout and can execute more efficiently.

scalablecory commented 3 years ago

Are you able to use HTTP/2 by chance? This has much more performant options for request cancellation.

theodorzoulias commented 3 years ago
// The time left before the scheduled cancellation.
// This is a convenience method that computes a value based on current time and
// the value of CancellationTime. If there is no scheduled time for cancellation, or
// if the time is not known, this property returns TimeSpan.MaxValue.
// If cancellation already happened, this property returns TimeSpan.Zero.
public TimeSpan TimeLeft { get; }

Should probably return Timeout.Infinite as well.

@Clockwork-Muse do you mean Timeout.InfiniteTimeSpan?

Clockwork-Muse commented 3 years ago

@theodorzoulias - Yes

fernacolo commented 3 years ago

@scalablecory

Are you able to use HTTP/2 by chance? This has much more performant options for request cancellation.

The HTTP/2 allows the client to send RST_STREAM frame to the server when cancellation is requested, but (1) it consumes more bandwidth and client resources, and (2) if the timeout is caused by network congestion, there is no guarantee that the server will get the frame.

In fact, to defend against network congestion, it's often the case that we need to avoid using the network itself just for control messages.

There is value in fire-and-forget semantics where the client can have fair assumption that the server is not working in the request anymore after the timeout is triggered. There are many protocols and distributed algorithms that rely on this feature.

scalablecory commented 3 years ago

@scalablecory

Are you able to use HTTP/2 by chance? This has much more performant options for request cancellation.

The HTTP/2 allows the client to send RST_STREAM frame to the server when cancellation is requested, but (1) it consumes more bandwidth and client resources, and (2) if the timeout is caused by network congestion, there is no guarantee that the server will get the frame.

In fact, to defend against network congestion, it's often the case that we need to avoid using the network itself just for control messages.

There is value in fire-and-forget semantics where the client can have fair assumption that the server is not working in the request anymore after the timeout is triggered. There are many protocols and distributed algorithms that rely on this feature.

Thanks. As I understand it, your desire is that you can do something like:

async Task CallMyCoolApi(CancellationToken cancellationToken)
{
    using var request = new HttpRequestMessage();

    if(cancellationToken.CancelDateTimeOffset is DateTimeOffset date)
    {
        request.Headers.Add("Timeout-After", date.ToString("R"));
    }

    using HttpResponseMessage response = await client.SendAsync(msg, cancellationToken);
    // ...
}

Where the server would be able to cancel the request itself rather than wait for the client to send a RST_STREAM. So this wouldn't be a direct benefit to e.g. HttpClient but it would allow you a standard way to funnel timeouts to places that accepted them.

The benefit here would really be that the server might stop processing sooner, before the RST_STREAM will arrive, and save some CPU. Note the pattern in my sample has both server-side and client-side cancellation because it's possible that network/server issues means the client could hang otherwise, so a RST_STREAM will get sent and an OperationCancelledException thrown regardless unless clocks are out of sync enough relative to network latency. And similarly, HTTP/1 connections would be killed and all the overhead there would still happen.

I'm not sure if this alone is enough benefit to make me want this change, but I do see the benefit. I'd love to see some more people comment with scenarios they'd use this in; the idea of further making cancellation token proper a stand-in for a timeout is interesting.

fernacolo commented 3 years ago

@scalablecory

And similarly, HTTP/1 connections would be killed and all the overhead there would still happen.

Why HttpClient would close a healthy HTTP/2 connection? It's just the stream that got closed, not the entire connection.

commonsensesoftware commented 2 years ago

I stumbled across by happenstance and decided I'd throw my 2¢ in.

While it's true that you can have time-based cancellations, every scenario I've ever encountered can be modeled through the combination of CancellationTokenSource and CancellationToken. I've seen attempts to mix and match these in the past and it just leads to unnecessarily complexity, confusion, and bugs. I witnessed this with older APIs that didn't accept CancellationToken and it turned into a real mess. I've also seen a parent's CancellationTokenSource passed down too, which is also wrong IMO.

Timeouts should be policy-based IMHO. There are arguably scenarios where the duration should be configurable, but I'd argue they are rare. I'm sure there are a lot of people who disagree. Who knows what a sensible timeout value should be better than the devs that wrote it? Allowing a timeout to be configurable introduces a new class of issues related to misconfiguration. In a shipped product, yeah - maybe it's needed. Maybe it's just me, but I can't think of a single time I ever needed to change a timeout value that wasn't better suited by an assumed default base on an established policy.

Configuration or not, the timeout policy should be well-defined. The upstream initiator (e.g. parent) should always be responsible for defining the downstream policy. At any point in time, a unit of work can impose their stated policy without having to worry about some external scope.

As an example of elaborating on some of the previous examples.

public class CoolApi
{
   readonly IHttpClientFactory factory;
   readonly IOptions<HttpClientOptions> options;
   public CoolApi(IHttpClientFactory factory, IOptions<HttpClientOptions> options)
   {
      this.factory = factory;
      this.options = options;
   }
   public async Task Invoke(CancellationToken cancellationToken)
   {
        // configuration or no, I dictate the policy; for example, 2 seconds
        var timeout = options.Value.Timeout;

        using client = factory.CreateClient();

        // RFC 7240: I can state policy as my 'preference' to the server, but it's not obligated to honor it
        client.DefaultRequestHeaders.TryAddWithoutValidation("Prefer", "wait=" + (int) timeout.TotalSeconds);

        // we don't know the upstream policy and it's irrelevant to us. cancellation may happen because
        // we exceeded our policy (say 2 seconds) or the parent's policy.
        using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
        cts.CancelAfter(timeout);

        using var response = await client.PostAsync("/invoke-coolness", cts.Token);
        response.EnsureSuccessStatusCode();
   }
}

RFC 7240 §4.3 provides a nice example of how a client can express its preference using the Prefer header. I would argue that 412 is not appropriate because that is a precondition. The only way that would happen is if the server receives 0 or a negative value. The client should already know that and not even make the request (or receive 400). 503 is more commonly used if a request times out. Regardless, the server may have its own policies that the client is not aware of, the server can ignore the client's preference (the server is in charge), the client could give up, or the client could get disconnected.

Let's use slightly bigger numbers. Let's assume the overall timeout established by a parent set things to 30 seconds (a la CancellationTokenSource.CancelAfter(TimeSpan.FromSeconds(30))). Our Cool API sets a timeout of 5 seconds (a la a refined policy of CancellationTokenSource.CancelAfter(TimeSpan.FromSeconds(5))). At this point, the API request cannot take more than 5 seconds - period; it may be less. Knowing how much time is left is not particularly interesting nor buys a lot of value. Giving the server a hint based on our preference is reasonable, but it already has to contend with disconnects, which includes the client giving up. There are all kinds of other issues related to monotonic clocks, accuracy, and so on. That also makes it very difficult to test.

It's also worth noting the cancellation via CancellationToken is not a thread abort. Cancellation is mutual. Cancellation may simply abandon the request and put the underlying connection back in the pool for use. This will partly depend on how your app is designed.

stephentoub commented 2 years ago

I appreciate the scenarios and the thought that went into the proposal, but this isn't something we're going to add. Timeouts can and are layered on top, with most CancellationToken{Source}s having no knowledge that they might eventually have cancellation requested due to a timeout. Adding this infrastructure into CancellationToken would be useful in only few niche scenarios, would complicate general consumption, and most importantly would not be pay-for-play (every CancellationTokenSource would become more expensive because of it).