dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.37k stars 4.75k forks source link

[API Proposal]: When stopping an IHost instance, IHostedService instances StopAsync method is called sequentially instead of asynchronously #68036

Closed jerryk414 closed 1 year ago

jerryk414 commented 2 years ago

Background and motivation

As it is right now, when an IHost is asked to StopAsync for graceful shutdown, each IHostedService is requested to StopAsync in sequential order. There's no reason why each IHostedService.StopAsync task can't be executed asynchronously and awaited altogether.

Why is this a problem? Well technically everything is functioning to spec, but in practice, it's problematic.

So normally the idea of being able to set the ShutdownTimeout is great.. we can set it to say, 30 seconds, and say we expect all IHostedService instances to take no longer than 30 seconds to shutdown gracefully. The problem is that, this is an additive timeout. That is to say, if you have 10 IHostedService instances, and each one of them takes 20 seconds to shutdown, your graceful shutdown now takes 200 seconds. Why do that when you can just have everything shutdown asynchronously in 20 seconds?

One example of the problem with this is that, when deploying apps to ECS in AWS, when ECS triggers an instance of an app to shutdown (which occurs under normal circumstances such as when scaling down), the maximum amount of time it gives for the app to gracefully shutdown is 120 seconds (see stopTimeout), after that it just kills the process.

There's no reason that i can see that the IHostedService instances should be shutdown sequentially, they are async for a reason, and if shutdown order is important, than it should be managed by the consumer themselves, or at the very least, there should be an option for asynchronous IHostedService shutdown.

Regression?

This is not a regression

Analysis

Link to code causing the issue: https://github.com/dotnet/runtime/blob/main/src/libraries/Microsoft.Extensions.Hosting/src/Internal/Host.cs#L126

This code could easily be modified to something similar to:

HashSet<Task> tasks = new HashSet<Task>();

foreach (IHostedService hostedService in _hostedServices.Reverse()) /* Why does this need to be reversed anyways? */
{
   tasks.Add(hostedService.StopAsync(token));
}

// WhenAll would be great here, but it will swallow up exceptions.. so this is the next best thing
foreach (Task task in tasks)
{
   try
   {
      await task.ConfigureAwait(false);
   }
   catch (Exception ex)
   {
      exceptions.Add(ex);
   }
}

API Proposal

    public class HostOptions
    {
+        public HostedServiceStopBehavior HostedServiceStopBehavior { get; set; } =
+            HostedServiceStopBehavior.Asynchronous;

        public BackgroundServiceExceptionBehavior BackgroundServiceExceptionBehavior { get; set; } =
            BackgroundServiceExceptionBehavior.StopHost;
    }

+    /// <summary>
+    /// Specifies a behavior that the <see cref="IHost"/> will honor when stopping <see cref="IHostedService"/> instances.
+    /// </summary>
+    public enum HostedServiceStopBehavior
+    {
+        /// <summary>
+        /// Indicates that the host will stop the <see cref="IHostedService"/> instances asynchronously.
+        /// </summary>
+        Asynchronous = 0,
+
+        /// <summary>
+        /// Indicates that the host will wait for the previous <see cref="IHostedService"/> instance to stop before stopping the next instance.
+        /// </summary>
+        Sequential = 1
+    }

API Usage

   public static IHostBuilder CreateWebHostBuilder(string[] args)
   {
      return Host.CreateDefaultBuilder(args)
         .ConfigureServices((context, services) =>
         {
            services.Configure<HostOptions>(options => 
            {
               options.BackgroundServiceStopBehavior = BackgroundServiceStopBehavior.Sequential; // Defaults to asynchronous
            }
         });
   }

Risks

The default behavior will technically be changing, which would be a breaking change. However, it may be a "safe" breaking change in the sense that the likelihood of someone building an app which depends on this previously internal behavior is highly unlikely and IF they did happen to, there is a way to revert back to the old behavior.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-extensions-hosting See info in area-owners.md if you want to be subscribed.

Issue Details
### Description As it is right now, when an `IHost` is asked to `StopAsync` for graceful shutdown, each `IHostedService` is requested to `StopAsync` in sequential order. There's no reason why each `IHostedService.StopAsync` task can't be executed asynchronously and awaited altogether. Why is this a problem? Well technically everything is functioning to spec, but in practice, it's problematic. So normally the idea of being able to set the `ShutdownTimeout` is great.. we can set it to say, 30 seconds, and say we expect all `IHostedService` instances to take no longer than 30 seconds to shutdown gracefully. The problem is that, this is an additive timeout. That is to say, if you have 10 `IHostedService` instances, and each one of them takes 20 seconds to shutdown, your graceful shutdown now takes 200 seconds. Why do that when you can just have everything shutdown asynchronously in 20 seconds? The real problem with this is that, when deploying apps to ECS in AWS, for example, when ECS triggers an instance of an app to shutdown, the maximum amount of time it gives for the app to gracefully shutdown is 120 seconds (see [`stopTimeout`](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html))e, after that it just kills the process. There's no reason that i can see that the IHostedService instances should be shutdown sequentially, they are async for a reason, and if shutdown order is important, than it should be managed by the consumer themselves, or at the very least, there should be an option for asynchronous IHostedService shutdown. ### Configuration ### Regression? I don't think it is ### Data ### Analysis Link to code causing the issue: https://github.com/dotnet/runtime/blob/main/src/libraries/Microsoft.Extensions.Hosting/src/Internal/Host.cs#L126 ```csharp HashSet tasks = new HashSet(); foreach (IHostedService hostedService in _hostedServices.Reverse()) /* Why does this need to be reversed anyways? */ { tasks.Add(hostedService.StopAsync(token)); } // WhenAll would be great here, but it will swallow up exceptions.. so this is the next best thing foreach (Task task in tasks) { try { await task.ConfigureAwait(false); } catch (Exception ex) { exceptions.Add(ex); } } ```
Author: jerryk414
Assignees: -
Labels: `tenet-performance`, `untriaged`, `area-Extensions-Hosting`
Milestone: -
eerhardt commented 2 years ago

cc @davidfowl - we were just discussing this a couple weeks ago.

davidfowl commented 2 years ago

Agree we should do this and list it as a breaking change (as stop async will run concurrently now). I don't know if we need a flag to opt out of this behavior though.

davidfowl commented 2 years ago

@jerryk414 would you be willing to send a PR for this change?

quixoticaxis commented 2 years ago

Agree we should do this and list it as a breaking change (as stop async will run concurrently now). I don't know if we need a flag to opt out of this behavior though.

Any knob to keep the old "started first — stopped last" would be appreciated. We use it to stop stateful in-process services which have cross dependencies.

jerryk414 commented 2 years ago

I can do that. I'll get an initial implementation done this week to see what thoughts are once the changes are visible.

jerryk414 commented 2 years ago

@davidfowl @quixoticaxis I've added an initial PR for this to make the default behavior asynchronous, with an option for specifying the behavior to be sequential. It would still be a breaking change, but would add the ability to change the behavior back to the way it was.

There are no tests yet - I want to get input if this is the route you would prefer prior to spending time on that.

quixoticaxis commented 2 years ago

@davidfowl @quixoticaxis I've added an initial PR for this to make the default behavior asynchronous, with an option for specifying the behavior to be sequential. It would still be a breaking change, but would add the ability to change the behavior back to the way it was.

There are no tests yet - I want to get input if this is the route you would prefer prior to spending time on that.

For our product it would be easy to switch back to, looks great.

jerryk414 commented 2 years ago

@davidfowl @quixoticaxis I've added an initial PR for this to make the default behavior asynchronous, with an option for specifying the behavior to be sequential. It would still be a breaking change, but would add the ability to change the behavior back to the way it was.

There are no tests yet - I want to get input if this is the route you would prefer prior to spending time on that.

@davidfowl @quixoticaxis The PR for this is ready for review.

eerhardt commented 2 years ago

I've updated the API proposal above and marked this proposal as ready for review. We will need to get the API proposal approved before we can make this change.

jerryk414 commented 2 years ago

I've updated the API proposal above and marked this proposal as ready for review. We will need to get the API proposal approved before we can make this change.

Thank you @eerhardt!

terrajobst commented 2 years ago

Video

namespace Microsoft.Extensions.Hosting;

public partial class HostOptions
{
    public bool ServicesStartConcurrently { get; set; } // = false; Maybe change in 8?
    public bool ServicesStopConcurrently { get; set; } // = false; Maybe change in 8?
}
quixoticaxis commented 2 years ago
- Does this imply that this can breaking because people have interdependencies between services?

It would probably not only break dependencies, but also mess up the exception handling around StartAsync.

jerryk414 commented 2 years ago

@terrajobst I'm presuming that based on the comment, you would want the default to be false for the time being.

I like the idea of just making it a bool since there really should only ever be two behaviors. I can work on getting this change together and updating the initial PR changes to account for this.

Just to answer this question more specifically... when hosting apps in AWS ECS, you have an absolute maximum shutdown time of 120 seconds - even if you set that higher in the host itself via ShutdownTimeout. We were running into it in a real-life production scenario where shutdown was taking longer than that. We fixed it by basically just announcing to all services that a shutdown is incoming and have them all start prepping to shutdown concurrently (if one starts shutting down, we can presume they all will be soon).

From the description in the issue: One example of the problem with this is that, when deploying apps to ECS in AWS, when ECS triggers an instance of an app to shutdown (which occurs under normal circumstances such as when scaling down), the maximum amount of time it gives for the app to gracefully shutdown is 120 seconds (see [stopTimeout](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html)), after that it just kills the process.

@quixoticaxis I do see what you're saying - as it is right now StartAsync doesn't do anything to manage multiple exceptions - it will just stop on the first failure and return that.

I can update it to behave the same as StopAsync though where it will aggregate all exceptions and just throw an AggregateException when there is more than one failure with some generic message One or more hosted services failed to start.

Basically copy what is being done here: https://github.com/dotnet/runtime/blob/main/src/libraries/Microsoft.Extensions.Hosting/src/Internal/Host.cs#L156

quixoticaxis commented 2 years ago

@jerryk414 I personally am not certain AggregateException would be a desired scenario, because in case some services failing and others starting you'd have to wait for all of the StartAsync either completing or faulting. Maybe throwing on the first exception and cancelling all other services (either currently starting or already started) is a better way.

Starting seems to be fundamentally different from stopping, IMHO.

jerryk414 commented 2 years ago

@quixoticaxis

I would disagree with that for two reasons.

1) In order to guarantee stopping on the first exception, you need to run things synchronously, which defeats the purpose. In addition, trying to cancel tasks or ignoring them and immediately returning on an exception sounds very misleading. 2) The aggregate exception would only occur on new behavior - you'd have to opt into it. I would also feel secure saying throwing an aggregate exception in a process that spawns many other tasks is not necessarily unexpected behavior.

Another fun point, assuming StartAsync is awaited using await, a bit of fancy magic happens that will cause the first exception in the AggregateException to be what is actually thrown when the task returns. So you as a consumer, you will effectively get the first exception... Unless you do something like a ContinueWith and inspect the actual exception.

quixoticaxis commented 2 years ago

@jerryk414

1) In order to guarantee stopping on the first exception, you need to run things synchronously, which defeats the purpose. In addition, trying to cancel tasks or ignoring them and immediately returning on an exception sounds very misleading.

No, I believe I don't, but, sorry, I'll post the code snippet tomorrow.

Your seconds point seems valid to me though.

jerryk414 commented 2 years ago

@jerryk414

  1. In order to guarantee stopping on the first exception, you need to run things synchronously, which defeats the purpose. In addition, trying to cancel tasks or ignoring them and immediately returning on an exception sounds very misleading.

No, I believe I don't, but, sorry, I'll post the code snippet tomorrow.

Your seconds point seems valid to me though.

Yeah maybe I'm just unaware, but I'm not aware of any way to guarantee stopping on a first error without running each startup one by one. I don't have anything in front of me right now, but is there even a cancelation token that's passed into each start async? Even if there is, I doubt many people check it.

And assuming they don't check it, you could end up in a situation where either you're killing threads with these tasks, which could lead to corrupted states, or you could end up just swallowing other real exceptions, which is also bad.

quixoticaxis commented 2 years ago

@jerryk414 , if someone starts the services concurrently, I suppose it's only fair to assume that their starting logic may be running for prolonged period of time, so there's a fair chance that they respect cancellation requests (at the very least, we do in our stateful services that pre-load data in StartAsync). If starting phase is fast, than what's the point of doing it in parallel?

Considering your point about AggregateException being an expected behavior I suppose the start may look like something along the lines (throws AggregateException and handles cancellation in case of long-running starting logic):

using var stopper = CancellationTokenSource.CreateLinkedTokenSource(originalToken);

await Task.WhenAll(
    services
        .Select(
            service => service.StartAsync(stopper.Token))
        .Select(
            task => task.ContinueWith(original =>
            {
                if (original.IsFaulted)
                {
                    try
                    {
                        stopper.Cancel();
                    }
                    catch (AggregateException)
                    {
                        // probably report cancellation callback exceptions
                    }

                    var dispatcher = ExceptionDispatchInfo.Capture(
                        original.Exception!);
                    dispatcher.Throw();
                }
            })));

Anyway, the longer I think about it, the more I'm inclined to assume that concurrent start is much more tricky than concurrent stop.

jerryk414 commented 2 years ago

@jerryk414 , if someone starts the services concurrently, I suppose it's only fair to assume that their starting logic may be running for prolonged period of time, so there's a fair chance that they respect cancellation requests (at the very least, we do in our stateful services that pre-load data in StartAsync). If starting phase is fast, than what's the point of doing it in parallel?

Considering your point about AggregateException being an expected behavior I suppose the start may look like something along the lines (throws AggregateException and handles cancellation in case of long-running starting logic):

using var stopper = CancellationTokenSource.CreateLinkedTokenSource(originalToken);

await Task.WhenAll(
    services
        .Select(
            service => service.StartAsync(stopper.Token))
        .Select(
            task => task.ContinueWith(original =>
            {
                if (original.IsFaulted)
                {
                    try
                    {
                        stopper.Cancel();
                    }
                    catch (AggregateException)
                    {
                        // probably report cancellation callback exceptions
                    }

                    var dispatcher = ExceptionDispatchInfo.Capture(
                        original.Exception!);
                    dispatcher.Throw();
                }
            })));

Anyway, the longer I think about it, the more I'm inclined to assume that concurrent start is much more tricky than concurrent stop.

Yeah, I'm just thinking it doesn't have to be that complicated. With it being a new feature, it could just be accepted as being the way it is. If you have a problem with it and have interdependencies that could cause you to need to stop a start async method if one fails first or require order, you could revert back to the old way, or add your own inter-service communication.

I personally would argue the ROI isn't there for the extra complexity of ensuring an immediate return on first exceptions.

quixoticaxis commented 2 years ago

@jerryk414 my point was that I doubt that concurrent start should be implemented in the first place: too many different usage scenarios.

buyaa-n commented 1 year ago

@jerryk414 are you still planning to work on it?

jerryk414 commented 1 year ago

@jerryk414 are you still planning to work on it?

I am - I had a PR up but honestly, it's funny. I'm doing this on my own time - i've got kids, family, and a fulltime job, and it's the holidays, so i just didn't get around to addressing all comments in the PR, and my personal computer took a crap on me, so i've got to get this all setup on another machine. It certainly would be nice if someone else who is paid to do this took ownership and addressed the very minor issues with the PR below.

https://github.com/dotnet/runtime/pull/75894

If nobody does, I do have plans to get back around to it, but not until the new year.

buyaa-n commented 1 year ago

I am - I had a PR up but honestly, it's funny. I'm doing this on my own time - i've got kids, family, and a fulltime job, and it's the holidays, so i just didn't get around to addressing all comments in the PR, and my personal computer took a crap on me, so i've got to get this all setup on another machine. It certainly would be nice if someone else who is paid to do this took ownership and addressed the very minor issues with the PR below.

Sure understand, no rush or any pressure, this is open source and it's up to you if you work on it or when you work. Though we would like to assign the issue if somebody already started working on it, so that other contributes would not start working on it simultaneously, if nobody working on it, we would add help wanted label so that other contributors could find it easily. I just saw you put up a PR and checking if you still planning to finish that PR.

If nobody does, I do have plans to get back around to it, but not until the new year.

Thank you, for now, nobody has planned to work on it, sounds like we can add help wanted label

pedrobsaila commented 1 year ago

Hey @buyaa-n I want to give it a try, you can assign it to me