Closed jerryk414 closed 1 year ago
Tagging subscribers to this area: @dotnet/area-extensions-hosting See info in area-owners.md if you want to be subscribed.
Author: | jerryk414 |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `untriaged`, `area-Extensions-Hosting` |
Milestone: | - |
cc @davidfowl - we were just discussing this a couple weeks ago.
Agree we should do this and list it as a breaking change (as stop async will run concurrently now). I don't know if we need a flag to opt out of this behavior though.
@jerryk414 would you be willing to send a PR for this change?
Agree we should do this and list it as a breaking change (as stop async will run concurrently now). I don't know if we need a flag to opt out of this behavior though.
Any knob to keep the old "started first — stopped last" would be appreciated. We use it to stop stateful in-process services which have cross dependencies.
I can do that. I'll get an initial implementation done this week to see what thoughts are once the changes are visible.
@davidfowl @quixoticaxis I've added an initial PR for this to make the default behavior asynchronous, with an option for specifying the behavior to be sequential. It would still be a breaking change, but would add the ability to change the behavior back to the way it was.
There are no tests yet - I want to get input if this is the route you would prefer prior to spending time on that.
@davidfowl @quixoticaxis I've added an initial PR for this to make the default behavior asynchronous, with an option for specifying the behavior to be sequential. It would still be a breaking change, but would add the ability to change the behavior back to the way it was.
There are no tests yet - I want to get input if this is the route you would prefer prior to spending time on that.
For our product it would be easy to switch back to, looks great.
@davidfowl @quixoticaxis I've added an initial PR for this to make the default behavior asynchronous, with an option for specifying the behavior to be sequential. It would still be a breaking change, but would add the ability to change the behavior back to the way it was.
There are no tests yet - I want to get input if this is the route you would prefer prior to spending time on that.
@davidfowl @quixoticaxis The PR for this is ready for review.
I've updated the API proposal above and marked this proposal as ready for review. We will need to get the API proposal approved before we can make this change.
I've updated the API proposal above and marked this proposal as ready for review. We will need to get the API proposal approved before we can make this change.
Thank you @eerhardt!
Concurrent
over Asynchronous
.namespace Microsoft.Extensions.Hosting;
public partial class HostOptions
{
public bool ServicesStartConcurrently { get; set; } // = false; Maybe change in 8?
public bool ServicesStopConcurrently { get; set; } // = false; Maybe change in 8?
}
- Does this imply that this can breaking because people have interdependencies between services?
It would probably not only break dependencies, but also mess up the exception handling around StartAsync
.
@terrajobst I'm presuming that based on the comment, you would want the default to be false for the time being.
I like the idea of just making it a bool since there really should only ever be two behaviors. I can work on getting this change together and updating the initial PR changes to account for this.
Just to answer this question more specifically... when hosting apps in AWS ECS, you have an absolute maximum shutdown time of 120 seconds - even if you set that higher in the host itself via ShutdownTimeout
. We were running into it in a real-life production scenario where shutdown was taking longer than that. We fixed it by basically just announcing to all services that a shutdown is incoming and have them all start prepping to shutdown concurrently (if one starts shutting down, we can presume they all will be soon).
From the description in the issue:
One example of the problem with this is that, when deploying apps to ECS in AWS, when ECS triggers an instance of an app to shutdown (which occurs under normal circumstances such as when scaling down), the maximum amount of time it gives for the app to gracefully shutdown is 120 seconds (see [stopTimeout](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html)), after that it just kills the process.
@quixoticaxis
I do see what you're saying - as it is right now StartAsync
doesn't do anything to manage multiple exceptions - it will just stop on the first failure and return that.
I can update it to behave the same as StopAsync
though where it will aggregate all exceptions and just throw an AggregateException
when there is more than one failure with some generic message One or more hosted services failed to start
.
Basically copy what is being done here: https://github.com/dotnet/runtime/blob/main/src/libraries/Microsoft.Extensions.Hosting/src/Internal/Host.cs#L156
@jerryk414 I personally am not certain AggregateException
would be a desired scenario, because in case some services failing and others starting you'd have to wait for all of the StartAsync
either completing or faulting. Maybe throwing on the first exception and cancelling all other services (either currently starting or already started) is a better way.
Starting seems to be fundamentally different from stopping, IMHO.
@quixoticaxis
I would disagree with that for two reasons.
1) In order to guarantee stopping on the first exception, you need to run things synchronously, which defeats the purpose. In addition, trying to cancel tasks or ignoring them and immediately returning on an exception sounds very misleading. 2) The aggregate exception would only occur on new behavior - you'd have to opt into it. I would also feel secure saying throwing an aggregate exception in a process that spawns many other tasks is not necessarily unexpected behavior.
Another fun point, assuming StartAsync is awaited using await
, a bit of fancy magic happens that will cause the first exception in the AggregateException
to be what is actually thrown when the task returns. So you as a consumer, you will effectively get the first exception... Unless you do something like a ContinueWith
and inspect the actual exception.
@jerryk414
1) In order to guarantee stopping on the first exception, you need to run things synchronously, which defeats the purpose. In addition, trying to cancel tasks or ignoring them and immediately returning on an exception sounds very misleading.
No, I believe I don't, but, sorry, I'll post the code snippet tomorrow.
Your seconds point seems valid to me though.
@jerryk414
- In order to guarantee stopping on the first exception, you need to run things synchronously, which defeats the purpose. In addition, trying to cancel tasks or ignoring them and immediately returning on an exception sounds very misleading.
No, I believe I don't, but, sorry, I'll post the code snippet tomorrow.
Your seconds point seems valid to me though.
Yeah maybe I'm just unaware, but I'm not aware of any way to guarantee stopping on a first error without running each startup one by one. I don't have anything in front of me right now, but is there even a cancelation token that's passed into each start async? Even if there is, I doubt many people check it.
And assuming they don't check it, you could end up in a situation where either you're killing threads with these tasks, which could lead to corrupted states, or you could end up just swallowing other real exceptions, which is also bad.
@jerryk414 , if someone starts the services concurrently, I suppose it's only fair to assume that their starting logic may be running for prolonged period of time, so there's a fair chance that they respect cancellation requests (at the very least, we do in our stateful services that pre-load data in StartAsync
). If starting phase is fast, than what's the point of doing it in parallel?
Considering your point about AggregateException
being an expected behavior I suppose the start may look like something along the lines (throws AggregateException
and handles cancellation in case of long-running starting logic):
using var stopper = CancellationTokenSource.CreateLinkedTokenSource(originalToken);
await Task.WhenAll(
services
.Select(
service => service.StartAsync(stopper.Token))
.Select(
task => task.ContinueWith(original =>
{
if (original.IsFaulted)
{
try
{
stopper.Cancel();
}
catch (AggregateException)
{
// probably report cancellation callback exceptions
}
var dispatcher = ExceptionDispatchInfo.Capture(
original.Exception!);
dispatcher.Throw();
}
})));
Anyway, the longer I think about it, the more I'm inclined to assume that concurrent start is much more tricky than concurrent stop.
@jerryk414 , if someone starts the services concurrently, I suppose it's only fair to assume that their starting logic may be running for prolonged period of time, so there's a fair chance that they respect cancellation requests (at the very least, we do in our stateful services that pre-load data in
StartAsync
). If starting phase is fast, than what's the point of doing it in parallel?Considering your point about
AggregateException
being an expected behavior I suppose the start may look like something along the lines (throwsAggregateException
and handles cancellation in case of long-running starting logic):using var stopper = CancellationTokenSource.CreateLinkedTokenSource(originalToken); await Task.WhenAll( services .Select( service => service.StartAsync(stopper.Token)) .Select( task => task.ContinueWith(original => { if (original.IsFaulted) { try { stopper.Cancel(); } catch (AggregateException) { // probably report cancellation callback exceptions } var dispatcher = ExceptionDispatchInfo.Capture( original.Exception!); dispatcher.Throw(); } })));
Anyway, the longer I think about it, the more I'm inclined to assume that concurrent start is much more tricky than concurrent stop.
Yeah, I'm just thinking it doesn't have to be that complicated. With it being a new feature, it could just be accepted as being the way it is. If you have a problem with it and have interdependencies that could cause you to need to stop a start async method if one fails first or require order, you could revert back to the old way, or add your own inter-service communication.
I personally would argue the ROI isn't there for the extra complexity of ensuring an immediate return on first exceptions.
@jerryk414 my point was that I doubt that concurrent start should be implemented in the first place: too many different usage scenarios.
@jerryk414 are you still planning to work on it?
@jerryk414 are you still planning to work on it?
I am - I had a PR up but honestly, it's funny. I'm doing this on my own time - i've got kids, family, and a fulltime job, and it's the holidays, so i just didn't get around to addressing all comments in the PR, and my personal computer took a crap on me, so i've got to get this all setup on another machine. It certainly would be nice if someone else who is paid to do this took ownership and addressed the very minor issues with the PR below.
https://github.com/dotnet/runtime/pull/75894
If nobody does, I do have plans to get back around to it, but not until the new year.
I am - I had a PR up but honestly, it's funny. I'm doing this on my own time - i've got kids, family, and a fulltime job, and it's the holidays, so i just didn't get around to addressing all comments in the PR, and my personal computer took a crap on me, so i've got to get this all setup on another machine. It certainly would be nice if someone else who is paid to do this took ownership and addressed the very minor issues with the PR below.
Sure understand, no rush or any pressure, this is open source and it's up to you if you work on it or when you work. Though we would like to assign the issue if somebody already started working on it, so that other contributes would not start working on it simultaneously, if nobody working on it, we would add help wanted
label so that other contributors could find it easily. I just saw you put up a PR and checking if you still planning to finish that PR.
If nobody does, I do have plans to get back around to it, but not until the new year.
Thank you, for now, nobody has planned to work on it, sounds like we can add help wanted
label
Hey @buyaa-n I want to give it a try, you can assign it to me
Background and motivation
As it is right now, when an
IHost
is asked toStopAsync
for graceful shutdown, eachIHostedService
is requested toStopAsync
in sequential order. There's no reason why eachIHostedService.StopAsync
task can't be executed asynchronously and awaited altogether.Why is this a problem? Well technically everything is functioning to spec, but in practice, it's problematic.
So normally the idea of being able to set the
ShutdownTimeout
is great.. we can set it to say, 30 seconds, and say we expect allIHostedService
instances to take no longer than 30 seconds to shutdown gracefully. The problem is that, this is an additive timeout. That is to say, if you have 10IHostedService
instances, and each one of them takes 20 seconds to shutdown, your graceful shutdown now takes 200 seconds. Why do that when you can just have everything shutdown asynchronously in 20 seconds?One example of the problem with this is that, when deploying apps to ECS in AWS, when ECS triggers an instance of an app to shutdown (which occurs under normal circumstances such as when scaling down), the maximum amount of time it gives for the app to gracefully shutdown is 120 seconds (see
stopTimeout
), after that it just kills the process.There's no reason that i can see that the IHostedService instances should be shutdown sequentially, they are async for a reason, and if shutdown order is important, than it should be managed by the consumer themselves, or at the very least, there should be an option for asynchronous IHostedService shutdown.
Regression?
This is not a regression
Analysis
Link to code causing the issue: https://github.com/dotnet/runtime/blob/main/src/libraries/Microsoft.Extensions.Hosting/src/Internal/Host.cs#L126
This code could easily be modified to something similar to:
API Proposal
API Usage
Risks
The default behavior will technically be changing, which would be a breaking change. However, it may be a "safe" breaking change in the sense that the likelihood of someone building an app which depends on this previously internal behavior is highly unlikely and IF they did happen to, there is a way to revert back to the old behavior.