[Bug]: Pipeline disposed and cannot be used anymore

App-vNext / Polly

Polly is a .NET resilience and transient-fault-handling library that allows developers to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback in a fluent and thread-safe manner. From version 6.0.1, Polly targets .NET Standard 1.1 and 2.0+.

https://www.thepollyproject.org

BSD 3-Clause "New" or "Revised" License

13.44k stars 1.23k forks source link

[Bug]: Pipeline disposed and cannot be used anymore #2019

Closed alexrosenfeld10 closed 7 months ago

alexrosenfeld10 commented 8 months ago

Describe the bug

When my application shuts down I wait for 10 seconds for ongoing requests, background jobs, etc. to finish up. During this time, I get the following error:

error.class:"System.ObjectDisposedException"
error.message:"This resilience pipeline has been disposed and cannot be used anymore.
Object name: 'ResiliencePipeline'."
error.stack:" at Polly.Utils.Pipeline.ComponentDisposeHelper.EnsureNotDisposed()
 at Polly.ResiliencePipeline.InitializeAsyncContext[TResult](ResilienceContext context)
 at Polly.ResiliencePipeline.ExecuteOutcomeAsync[TResult,TState](Func`3 callback, ResilienceContext context, TState state)
 at Microsoft.Extensions.Http.Resilience.Internal.ResilienceHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
 at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.<SendCoreAsync>g__Core|5_0(HttpRequestMessage request, Boolean useAsync, CancellationToken cancellationToken)
 at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)

Expected behavior

The pipeline shouldn't be disposed if the application is still using it

Actual behavior

The pipeline is disposed while the application is shutting down, and ongoing requests fail

Steps to reproduce

No response

Exception(s) (if any)

No response

Polly version

8.3.0, Microsoft.Extensions.Http.Resilience 8.1.0

.NET Version

mcr.microsoft.com/dotnet/aspnet:8.0

Anything else?

No response

peter-csala commented 8 months ago

Could you please share with us how do you use Polly?

From the stack trace we can only see that you are using with HttpClient.

alexrosenfeld10 commented 8 months ago

@peter-csala thanks - see below:

    public static void AddResilientVimeoClient(this IServiceCollection serviceCollection) =>
        serviceCollection.AddHttpClient<IVimeoClient, VimeoClient>(client =>
            {
                client.BaseAddress = new Uri("https://api.vimeo.com");
                client.DefaultRequestHeaders.Add("Accept", "application/vnd.vimeo.*+json;version=3.4");
            })
            .AddResilienceHandler(nameof(VimeoClient), (pipelineBuilder, context) =>
            {
                var logger = context.ServiceProvider.GetRequiredService<ILogger<IVimeoClient>>();
                pipelineBuilder
                    .AddRetry(new RetryStrategyOptions<HttpResponseMessage>
                    {
                        MaxRetryAttempts = 5,
                        Delay = TimeSpan.FromSeconds(1),
                        BackoffType = DelayBackoffType.Exponential,
                        UseJitter = true,
                        ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                            .Handle<HttpRequestException>(exception =>
                                exception.Message.StartsWith("Resource temporarily unavailable")),
                        OnRetry = arguments =>
                        {
                            logger.LogWarning(
                                arguments.Outcome.Exception,
                                "An HttpRequestException was encountered, delaying for {Delay}ms before making retry attempt {RetryCount}",
                                arguments.RetryDelay.TotalMilliseconds,
                                arguments.AttemptNumber);
                            return ValueTask.CompletedTask;
                        }
                    })
                    .AddRetry(new HttpRetryStrategyOptions
                    {
                        MaxRetryAttempts = 5,
                        Delay = TimeSpan.FromSeconds(1),
                        BackoffType = DelayBackoffType.Exponential,
                        UseJitter = true,
                        ShouldHandle = async args =>
                        {
                            if (args.Outcome.Result is null)
                            {
                                return false;
                            }

                            var httpResponseMessage = args.Outcome.Result;
                            var responseBody = await httpResponseMessage.Content.ReadAsStringAsync();
                            return httpResponseMessage.StatusCode == HttpStatusCode.InternalServerError &&
                                   (responseBody.Contains("Whoops!", StringComparison.OrdinalIgnoreCase) ||
                                    responseBody.Contains("Please try again", StringComparison.OrdinalIgnoreCase));
                        },
                        OnRetry = arguments =>
                        {
                            logger.LogWarning(
                                arguments.Outcome.Exception,
                                "A potentially retryable 500 Internal Server Error was encountered, delaying for {Delay}ms before making retry attempt {RetryCount}",
                                arguments.RetryDelay.TotalMilliseconds,
                                arguments.AttemptNumber);
                            return ValueTask.CompletedTask;
                        }
                    })
                    .AddRetry(new HttpRetryStrategyOptions
                    {
                        MaxRetryAttempts = 5,
                        Delay = TimeSpan.FromSeconds(1),
                        BackoffType = DelayBackoffType.Exponential,
                        UseJitter = true,
                        ShouldHandle = async args =>
                        {
                            if (args.Outcome.Result is null)
                            {
                                return false;
                            }

                            var httpResponseMessage = args.Outcome.Result;
                            var responseBody = await httpResponseMessage.Content.ReadAsStringAsync();
                            return httpResponseMessage.StatusCode == HttpStatusCode.ServiceUnavailable &&
                                   (responseBody.Contains("fastly", StringComparison.OrdinalIgnoreCase) ||
                                    responseBody.Contains("Please try again soon", StringComparison.OrdinalIgnoreCase));
                        },
                        OnRetry = arguments =>
                        {
                            logger.LogWarning(
                                arguments.Outcome.Exception,
                                "A 503 Service Unavailable was encountered, delaying for {Delay}ms before making retry attempt {RetryCount}",
                                arguments.RetryDelay.TotalMilliseconds,
                                arguments.AttemptNumber);
                            return ValueTask.CompletedTask;
                        }
                    })
                    .AddRetry(new HttpRetryStrategyOptions
                    {
                        MaxRetryAttempts = 5,
                        Delay = TimeSpan.FromSeconds(1),
                        BackoffType = DelayBackoffType.Exponential,
                        UseJitter = true,
                        ShouldHandle = args =>
                        {
                            if (args.Outcome.Result is null)
                            {
                                return ValueTask.FromResult(false);
                            }

                            var httpResponseMessage = args.Outcome.Result;
                            var isTextTrackEndpoint = httpResponseMessage
                                .RequestMessage?
                                .RequestUri?
                                .AbsoluteUri
                                .Contains("texttracks", StringComparison.OrdinalIgnoreCase) ?? false;
                            return ValueTask.FromResult(httpResponseMessage.StatusCode == HttpStatusCode.NotFound &&
                                                        isTextTrackEndpoint);
                        },
                        OnRetry = arguments =>
                        {
                            logger.LogWarning(
                                arguments.Outcome.Exception,
                                "A 404 Not Found on texttracks endpoint was encountered, delaying for {Delay}ms before making retry attempt {RetryCount}",
                                arguments.RetryDelay.TotalMilliseconds,
                                arguments.AttemptNumber);
                            return ValueTask.CompletedTask;
                        }
                    })
                    .AddRetry(new HttpRetryStrategyOptions
                    {
                        MaxRetryAttempts = 5,
                        Delay = TimeSpan.FromSeconds(1),
                        BackoffType = DelayBackoffType.Exponential,
                        UseJitter = true,
                        ShouldHandle = async args =>
                        {
                            if (args.Outcome.Result is null)
                            {
                                return false;
                            }

                            var httpResponseMessage = args.Outcome.Result;
                            var responseBody = await httpResponseMessage.Content.ReadAsStringAsync();
                            return httpResponseMessage.StatusCode == HttpStatusCode.BadGateway &&
                                   responseBody.Contains("server encountered a temporary error",
                                       StringComparison.OrdinalIgnoreCase);
                        },
                        OnRetry = arguments =>
                        {
                            logger.LogWarning(
                                arguments.Outcome.Exception,
                                "A 502 Bad Gateway was encountered, delaying for {Delay}ms before making retry attempt {RetryCount}",
                                arguments.RetryDelay.TotalMilliseconds,
                                arguments.AttemptNumber);
                            return ValueTask.CompletedTask;
                        }
                    })
                    ;
            });

peter-csala commented 8 months ago

@alexrosenfeld10 Thanks for sharing the code fragment. Just to confirm:

You have a typed HttpClient (which is registered with Transient scope)
You have decorated that HttpClient with a single ResiliencePipeline
The pipeline has 5 retry strategies in it
- They are mutually exclusive (they are handling different failures)
- But the properties used for delay calculation are the same
- The log messages also differ from case to case

Is my understanding correct?

These strategies can retry the request almost indefinitely because they are chained together. That means

if an inner strategy fails N times but N+1 attempt returns a different outcome then an outer strategy might handle it.
If an outer strategy handles it then the next retry will go through the inner strategies again and resetting their retry counter

For the sake of simplicity lets assume that you have only two retry strategies.
Let me show you what can happen via a sequence diagram

sequenceDiagram
    autonumber
    actor C as Caller
    participant P as Pipeline
    participant RO as Retry Outer
    participant RI as Retry Inner
    participant HC as HttpClient

    C->>P: Calls ExecuteAsync
    P->>RO: Calls ExecuteCore
    Note over RO: Counter is 0
    RO->>RI: Calls ExecuteCore
    Note over RI: Counter is 0
    RI->>HC: Calls SendAsync
    HC->>RI: Returns Failure X
    RI->>RI: Delays
    Note over RI: Counter is 1
    RI->>HC: Calls SendAsync
    HC->>RI: Returns Failure X
    RI->>RI: Delays
    Note over RI: Counter is 2
    RI->>HC: Calls SendAsync
    HC->>RI: Returns Failure Y
    RI->>RO: Returns Failure Y
    RO->>RO: Delays
    Note over RO: Counter is 1
    RO->>RI: Calls ExecuteCore
    Note over RI: Counter is 0
    RI->>HC: Calls SendAsync
    HC->>RI: Returns Failure X
    RI->>RI: Delays
    Note over RI: Counter is 1
    Note over RI: ...

A better solution would be to combine all these strategies into a single one. Do you need my assistance there?

alexrosenfeld10 commented 8 months ago

@peter-csala , thanks for your thoughtful reply. Your understanding is correct.

I understand what you're saying, but the likelihood that failure X would happen 0..n times, and then right after those times failure Y would happen another 0..n is very unlikely. I'm not sure I'm really worried about this case.

Would this be related to the pipeline being disposed and not usable anymore? I'm not sure how / why it would be.

peter-csala commented 8 months ago

Well, without sufficient context it is pretty hard to tell what is causing the problem.

So, let me try to reiterate through the StackTrace (in a reverse order) and see what we have:

System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)

An HTTP request is about to send

Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.<SendCoreAsync>g__Core|5_0(HttpRequestMessage request, Boolean useAsync, CancellationToken cancellationToken)

A DelegatingHandler is invoked which is responsible for the request logging

Microsoft.Extensions.Http.Resilience.Internal.ResilienceHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)

Yet another DelegatingHandler is invoked which is responsible to invoke the ResiliencePipeline

Polly.ResiliencePipeline.ExecuteOutcomeAsync[TResult,TState](Func'3 callback, ResilienceContext context, TState state)

The pipeline is about to execute in a safe manner (no exception is thrown in case of failure rather than captured in an Outcome object)

Polly.ResiliencePipeline.InitializeAsyncContext[TResult](ResilienceContext context)

It ensures that the pipeline is not disposed before saving some metadata about the execution context

Polly.Utils.Pipeline.ComponentDisposeHelper.EnsureNotDisposed()

It checks whether the pipeline is already disposed or not

The message handlers disposal process is quite complex. I've recently tried to detail it on this SO topic. Long story short you can change the default HandlerLifetime (which is 2 minutes) via the SetHandlerLifetime.

It might make sense to increase the value but as you said you have waited only 10 seconds...

alexrosenfeld10 commented 8 months ago

@peter-csala thanks, not sure that'd be the issue, we shutdown in 10 sec. This really feels more like a bug in polly / msft http resilience extensions. I'm not sure what other context is needed here, that's my full resilience pipeline config and it happens at shutdown. What else can I provide?

peter-csala commented 8 months ago

@alexrosenfeld10 Just to name a few things that we don't know:

How the VimeoClient is implemented
- How do you use the VimeoClient
- Which endpoint do you hit when you shut down the application
- Are you hitting the test or the prod API of Vimeo
- Is there any explicit or implicit Dispose call in your code
Is it happening all the time (deterministically) or just occasionally (undeterministically)
- Did it get stuck in a retry loop when the ObjectDisposedException is thrown
What is the success/failure rate of your requests
If you have a single simpler retry logic then do you experience the same
etc...

So, to perform triaging it would be great to have a minimal reproducible example code or a unit test. Otherwise it is just educated guesses with assumption how things can go wrong.

gintsk commented 8 months ago

We should take a look at ComponentDisposeHelper implementation, as ComponentDisposeHelper might not correctly implement DisposeAsync pattern. According to description of IAsyncDisposable.DisposeAsync

If an object's DisposeAsync method is called more than once, the object must ignore all calls after the first one and synchronously return a successfully completed ValueTask. The object must not throw an exception if its DisposeAsync method is called multiple times. Instance methods other than DisposeAsync can throw an ObjectDisposedException when resources are already disposed.

@martintmk could you please comment current implementation of ComponentDisposeHelper on why it is neccessary to throw ObjectDisposedException if object is disposed?

update: I'm wrong as EnsureNotDisposed is a supplementary method and is not part of dispose pattern, and it's called on callback execution.

alexrosenfeld10 commented 8 months ago

@peter-csala I mean no disrespect, but how are most of those questions relevant? They have almost nothing to do with the resilience pipeline, polly, or how it's being used and disposed. Perhaps I'm misunderstanding something and they actually do. Regardless, answers below:

How is the VimeoClient implemented

Standard stuff, all methods end up doing what you'd expect (var response = await _httpClient.SendAsync(request);)

Which endpoint do you hit when you shut down the application

I don't understand the question. The application is shut down via kubernetes sending a TERM signal to the pod, not by an endpoint. If you mean which Vimeo endpoint we call, we don't call a specific one on shutdown, we'd be calling endpoints related to whatever the application is currently doing.

Are you hitting the test or the prod API of Vimeo

There is no reason this would matter or be related to the problem at hand. Regardless, you can find the answer in my original post.

Is there any explicit or implicit Dispose call in your code

No - none related to http client usage, polly, or the resilience pipeline configuration.

Is it happening all the time (deterministically) or just occasionally (undeterministically)

Occasionally, on shutdown. Sometimes it doesn't happen (I am guessing these cases are when the app shuts down and there happen to be no ongoing http requests from the VimeoClient during shutdown, but I don't know, it's a guess)

Did it get stuck in a retry loop when the ObjectDisposedException is thrown

I don't know how to answer that, all I see is the stacktrace, it's all I am given for debugging.

What is the success/failure rate of your requests

Shouldn't matter? Mostly success with some sporadic failures, which is why I'm using resilience here anyway.

If you have a single simpler retry logic then do you experience the same

Haven't tested this but again, shouldn't matter? If it does, that's clearly a design flaw in polly / resilience.

Yes, of course to some extent there is guesswork here, I don't know the inner workings of the library, that's why I'm seeking out help from maintainers. I can try to create a minimum and reproducible example, but it's difficult, these kind of problems are hard to reproduce and debug, so hopefully some knowledgable maintainers / authors can spot the issue just from reading and thinking over the code.

@gintsk thanks for taking a look, even if you didn't find the bug, glad to have eyes on this

peter-csala commented 8 months ago

How the VimeoClient is implemented

From the ObjectDisposedException perspective it can be relevant whether you fetch strategies from the ResiliencePipelineProvider/ResiliencePipelineRegistry inside the typed client. It is called out on the documentation page that the disposal of the registry also disposes the strategies.

How do you use the VimeoClient

From the ObjectDisposedException perspective it can be relevant whether the DI subsystem injects a VimeoClient instance to the usage class or you create one (or many) via ITypedHttpClientFactory. In later case the lifecycle management of the client is in your hand.

Which endpoint do you hit when you shut down the application

It is important to understand whether the exception is thrown all the time regardless of the Vimeo responses or only specific communication patterns causes the issue. Because you have many different retry strategies narrowing down the possible root causes is essential during triaging.

Are you hitting the test or the prod API of Vimeo

Integration testing environments are usually running on less powerful resources than the production system. They might host older version or even custom code (for example via feature flags).

This could result less reliable responses. You have defined many retry strategies and they might be more utilized (even during shutdown) in an integration testing environment than in the production env.

What is the success/failure rate of your requests

If your requests are mostly successful then no retry attempts are required. Basically the requests are just passing through the ResiliencePipeline.

If the failure rate is high then there could be many outstanding/pending requests while you try to shut down the application.

Did it get stuck in a retry loop when the ObjectDisposedException is thrown

I don't know how to answer that, all I see is the stacktrace, it's all I am given for debugging.

Because all your retry strategies define an OnRetry delegate where you perform logging that's why the logs should indicate whether there was a retry loop or not prior the exception was thrown.

Occasionally, on shutdown. Sometimes it doesn't happen (I am guessing these cases are when the app shuts down and there happen to be no ongoing http requests from the VimeoClient during shutdown, but I don't know, it's a guess)

It would be great to know under what circumstance does it happen.

Yes, of course to some extent there is guesswork here, I don't know the inner workings of the library, that's why I'm seeking out help from maintainers. I can try to create a minimum and reproducible example, but it's difficult, these kind of problems are hard to reproduce and debug, so hopefully some knowledgable maintainers / authors can spot the issue just from reading and thinking over the code.

If you did not find my triaging attempt useful then please let me know. I'll step back and let other contributors/maintainers help you.

martincostello commented 7 months ago

How to you wait for the 10 seconds? Is it by using IHostApplicationLifetime.ApplicationStopping to do a Task.Delay() or similar?

I would have thought if doing so that the framework would not dispose of the DI container until it stopped the app. If the host has disposed of the DI container, then anything currently inflight would get issues resolving services from DI if they are resolved lazily during the processing of an HTTP request.

The AddResilienceHandler() extension isn't part of Polly, but part of the dotnet/extensions repo that in turn builds on top of Polly.

Without something that can readily reproduce the issue to debug, I don't think it's clear whether this is "by design" because the app has shutdown and the request is happening "too late", or whether this is a bug in one or more of the following things:

Polly
.NET Resilience Extensions
HttpClientFactory

alexrosenfeld10 commented 7 months ago

@martincostello Thanks, and yes, here's what it looks like:

    public static IApplicationBuilder ConfigureAppShutdown(this IApplicationBuilder app,
        int millisecondsTimeout = 10_000)
    {
        var lifetime = app.ApplicationServices.GetRequiredService<IHostApplicationLifetime>();
        var logger = app.ApplicationServices.GetService<ILogger>();
        lifetime.ApplicationStopping.Register(() =>
        {
            logger?.Warning("Stopping event SIGTERM received, waiting for 10 seconds");
            Thread.Sleep(millisecondsTimeout);
            logger?.Warning("Termination delay complete, continuing stopping process");
        });

        return app;
    }

I realize Thread.Sleep is generally the wrong answer, however I don't think it matters here (it will block the main thread, which isn't what hosted services or http requests use in .NET).

I agree - this disposal should be the responsibility of the framework.

I gather you think I should open an issue to dotnet extensions?

martincostello commented 7 months ago

You could, but @martintmk works on both so he can chime in on where the issue might be, but if you did create a second issue there's a good chance you'll still get asked for a repro that can be used to definitively answer whose problem it is to resolve.

alexrosenfeld10 commented 7 months ago

I'm happy to invest in creating a repro (haven't tried yet, happy to, just been slammed with work and sick at the same time). Let's see what Martin (you're both Martin it appears, I meant @martintmk as you mentioned) has to say here and then go forward from there.

martintmk commented 7 months ago

Hey @alexrosenfeld10

You can only get ObjectDisposedException from the pipeline when it's owner (ResiliencePipelineRegistry) is disposed. The disposal of registry can happen only when the DI container is disposed, unless you are doing something odd such as disposing it manually before that.

My theory why this can be happening is that some service registered in the DI container is using the pipeline in it's dispose method. If the container disposes the registry before the service is disposed, it's disposal will fail.

I tried to confirm my theory, but couldn't replicate it:

var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddResiliencePipeline(
    "my-pipeline", 
    builder => builder.AddTimeout(TimeSpan.FromSeconds(1)));

builder.Services.AddSingleton<MyService>();

var host = builder.Build();
await host.StartAsync();
var myService = host.Services.GetRequiredService<MyService>();
host.Dispose(); // This passes

class MyService(ResiliencePipelineProvider<string> provider) : IDisposable
{
    private readonly ResiliencePipeline pipeline = provider.GetPipeline("my-pipeline");

    public void Dispose() 
        => pipeline.Execute(() => Console.WriteLine("Using the pipeline when disposing"));
}

At this point we need a repro code to proceed. I suspect you might be doing something unusual that I am not able to track down. But maybe there is indeed some bug in somewhere, if so my bet is that it is outside of Polly.

alexrosenfeld10 commented 7 months ago

@martintmk Thanks for the detailed reply. I'm using Lamar as my container, perhaps it's some interaction between the two.

btw, this was only happening when my application was restarting heavily due to some other performance issues. Since resolving those, I haven't seen this before. I'm guessing there was some "shut down immediately after starting up" or "shut down while gazillions of http requests are going" type issue that was causing this. Now that it's not happening at all anymore, I'm not planning on spending time digging into it further.

Again, thanks for the assistance here - I appreciate it, even though we didn't find the root cause, it was helpful to brainstorm a bit.