dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Investigate whether we can speed up handler cleanup when app is not receiving traffic #42143

Open rynowak opened 5 years ago

rynowak commented 5 years ago

See discussion here: https://github.com/aspnet/AspNetCore/issues/3470#issuecomment-455001937

@acds reported that there was an ongoing message handler cleanup cycle with only a few handlers. This is somewhat normal since we can only collect a handler if it's eligible to be collected (because the HttpClient instances have been GC'ed.

However in this case the app didn't have any traffic going to it and this was causing issues because:

The issues is it’s in a Triggered Azure WebJob that fails with a timeout as it has not completed that clean up after the requisite 120 seconds. This can be extended but it seems that switched to a continuous job it continues to clean up for the hour to the next scheduled trigger. Burning up resources for nothing than letting the Job sleep.

We should investigate what we can do for cases like this where HttpClient factory is being used in a short-running scenario.

rynowak commented 5 years ago

@fabiocav - does WebJobs dispose the service provider for a case like this? What's causing the timeout?

acds commented 5 years ago

Think of the WebJob as a console app with a host builder configuration.

The host does not complete (and the console app exit) until the HttpClientFactory completes its cleanup.

The Azure App Service container running the WebJob expects as default to triggered jobs that it completes as default in 120secs once it goes idle, so just terminated the WebJob process as a Failure than the Success status that it should be.

fabiocav commented 5 years ago

@acds are you referring to a timeout waiting for the shutdown/cleanup as a result of a function execution timeout? If so, it does look like WebJobs currently disposes host, but there's a work item to address that.

acds commented 5 years ago

Well not an Azure Function but a WebJob Trigged Function, but yes, any insights on the workaround ?

Give the WebJob v3 API is thoroughly undocumented just now....and is radically different then v2.

rynowak commented 5 years ago

One workaround would be triggering a full GC a few times when your function is shutting down.

GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();

Another workaround would be setting the handler expiry to Timeout.InfiniteTimeSpan as documented here: https://github.com/aspnet/Extensions/blob/557995ec322f1175d6d8a72a41713eec2d194871/src/HttpClientFactory/Http/src/DependencyInjection/HttpClientBuilderExtensions.cs#L502 If your job/service is short lived there should be no drawback to that. The reason why we rotate handlers is because DNS updates don't work reliably on all existing platforms.

acds commented 5 years ago

Hi @rynowak thanks....and apologies I was away, and am just getting back to this.

the CG.Collect(); does to seem to make any difference.

I already have been experimenting with the Handler Expiry, setting to Infinite, seams to have the effect that the job still seems to not exit/sleep, but does not raise the HttpClientFactory cleanup diagnostics.

I currently have set to a continuous job, but want to revert to "triggered". The job is short lived, and will run every few hours. Having it deployed as "continuous" kinds defeats the object of having a triggered job.

Any other suggestions ?

Thanks!

acds commented 5 years ago

@fabiocav & @rynowak & @Eilon - any updates?

Not sure if this is any use - in Rider I can pull up a memory dump - and notice that the HttpClients are not disposed, even though out of scope:

screen shot 2019-02-05 at 6 05 47 pm

my default code patterns is as follows:

private async Task<System.Net.HttpStatusCode> PatchCredentials(MyType myType)
{
    HttpClient client = _httpClientFactory.CreateClient("graph"); //using a named client
    HttpContent content = new StringContent(SerializeCredentials.ToJson(myType), Encoding.UTF8, "application/json");

    using (var httpResponse = await client.PatchAsync("", content))
    {
        httpResponse.EnsureSuccessStatusCode();
        return httpResponse.StatusCode;
    }
}

the client factory is configured as follows:

.ConfigureServices((context, services) =>
{
...

    services.AddTransient<GraphAuthenticationHandler>();
    services.AddHttpClient("graph", c =>
        {
            c.BaseAddress = new Uri($"{myURI}");
        }).AddHttpMessageHandler<GraphAuthenticationHandler().SetHandlerLifetime(Timeout.InfiniteTimeSpan);

...
}

what else should I be looking at...?

Edit:

Additional info:

On review of memory before I garbage collect this is the number of instances:

screen shot 2019-02-06 at 10 33 10 am

After the first GC.Collect();

screen shot 2019-02-06 at 10 35 50 am

Subsequent call to GC.Collect(); and GC.WaitForPendingFinalizers(); make no difference...

What is holding on to the HttpClients ?

Thanks for any and all help!

rynowak commented 5 years ago

@acds - sorry for the delay - I've been on vacation.

The memory dump should be able to show you why those objects are still live. https://www.jetbrains.com/help/dotmemory/Shortest_Paths_to_Roots.html You want to look for the "path to root".

@faviocb - can you provide some background about what's keeping his function live?

The host does not complete (and the console app exit) until the HttpClientFactory completes its cleanup.

What is the mechanism that does this? We don't have a feature for forcing client factory cleanup so how does this work?

acds commented 5 years ago

Hi @rynowak & @fabiocav

Here is some output from dotMemory:

Here is an Expansion of each:

  1. QuickPulseTelemetartyModule (2 instances) screen shot 2019-02-12 at 5 32 36 pm
  2. Configuration Root (Key Vault Connection - 1 Instance) screen shot 2019-02-12 at 5 34 30 pm
  3. Lazy (1 Instance) screen shot 2019-02-12 at 5 36 31 pm
  4. HttpClient - I'm not holing any references.... screen shot 2019-02-12 at 5 37 56 pm

Any thoughts? Look like possible memory leaks (or as you point out "cleanup") in the runtime around Application Insights, Key Vault Configuration Integration, and HttpClientFactory (?)

It's a chicken-and-egg, the WebJob await host.RunAsync(); wont shut down until the references are cleared, but some of these references are being held up by the host...(via Configuration/KeyVault integration, App Insights integration, and it seems the Lazy<> references in the HttpClient Factory)!

I was suspecting that AzureServiceTokenProvider as part of Microsoft.Azure.Services.AppAuthentication was involved, but seem the .Net Core implementation is not leveraging the 'HttpClientFactory' internally.

rynowak commented 5 years ago

So that last entry points out that there are three HttpClient instances held in statics including 1 in a Lazy<HttpClient>.

It's a chicken-and-egg, the WebJob await host.RunAsync(); wont shut down until the references are cleared, but some of these references are being held up by the host...(via Configuration/KeyVault integration, App Insights integration, and it seems the Lazy<> references in the HttpClient Factory)!

@mathewc @fabiocav - is there any more information you can provide? I don't understand what WebJobs specific mechanism is causing problems here.

acds commented 5 years ago

TBH: On reflection there seems to be a conflict with using a TimerTrigger job with the expectation that this can be deployed as "triggered" not "continuous", as The TimeTrigger Job does not end...it just idles for the next trigger time.

Whist between "continuous" time triggered runs, and Idle, the HttpClientFactory is trying to clean up any HttpClient/Handler. It can't as the WebJob runtime it holding onto at least two of them (in my case, KeyVault integration and App Insights). So at least setting the SetHandlerLifetime(Timeout.InfiniteTimeSpan) eliminates the logging noise.

There seem to be (undocumented) design decisions/direction to move away from "triggered" WebJob deployments (in general):

That being said, in hindsight it seem that it may be better to move the functionality to a IHostedService as a background task in the main Web/API App implementation and eliminate to deployment complexity.

At the end of the day the problems here all tie back to any real WebJob 3.0 documentation and there is likely no issue here...

One hopes that Microsoft do not intend to release future versions with out the requisite documentation, to avoid churn as we learn the hard way.

fabiocav commented 5 years ago

@acds apologies for the delayed response. Somehow, I managed to miss notifications on this issue.

You're correct on what you've outlined above.

Adding @mathewc as he can also provide more context on the WebJobs infrastructure, if needed.

srogovtsev commented 4 years ago

I am terribly sorry for barging in on a distantly connected issue, but I was trying to find a follow up on aspnet/HttpClientFactory#165, which was created from dotnet/aspnetcore#3470, which in turn led to this issue, and they are all closed for comments, and I really don't think it merits an issue in itself.

Now, I fully understand what @davidfowl says:

The logs seem indicate that things are never going out of scope and hence the supposed leak. What's happening is the timer that is used to expire the handler has a reference to a callback which seemingly keeps the HttpMessageHandler alive.

When a GC happens things are cleaned up as expected. Until then, the background scanner will keep looping until that happens.

But I still am worried: if for some reason GC doesn't clean things up for some time, I might end up having quite a few of handlers linked there being undisposed. Wouldn't that pose an issue with socket exhaustion, which was one of the original reasons for reusing the handlers? Especially that I am on .net 4.8, not Core, so I don't have access to SocketsHttpHandler.

ghost commented 4 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

ghost commented 4 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.