Investigate whether we can speed up handler cleanup when app is not receiving traffic

rynowak commented 5 years ago

See discussion here: https://github.com/aspnet/AspNetCore/issues/3470#issuecomment-455001937

@acds reported that there was an ongoing message handler cleanup cycle with only a few handlers. This is somewhat normal since we can only collect a handler if it's eligible to be collected (because the HttpClient instances have been GC'ed.

However in this case the app didn't have any traffic going to it and this was causing issues because:

The issues is it’s in a Triggered Azure WebJob that fails with a timeout as it has not completed that clean up after the requisite 120 seconds. This can be extended but it seems that switched to a continuous job it continues to clean up for the hour to the next scheduled trigger. Burning up resources for nothing than letting the Job sleep.

We should investigate what we can do for cases like this where HttpClient factory is being used in a short-running scenario.

rynowak commented 5 years ago

@fabiocav - does WebJobs dispose the service provider for a case like this? What's causing the timeout?

acds commented 5 years ago

Think of the WebJob as a console app with a host builder configuration.

The host does not complete (and the console app exit) until the HttpClientFactory completes its cleanup.

The Azure App Service container running the WebJob expects as default to triggered jobs that it completes as default in 120secs once it goes idle, so just terminated the WebJob process as a Failure than the Success status that it should be.

fabiocav commented 5 years ago

@acds are you referring to a timeout waiting for the shutdown/cleanup as a result of a function execution timeout? If so, it does look like WebJobs currently disposes host, but there's a work item to address that.

acds commented 5 years ago

Well not an Azure Function but a WebJob Trigged Function, but yes, any insights on the workaround ?

Give the WebJob v3 API is thoroughly undocumented just now....and is radically different then v2.

rynowak commented 5 years ago

One workaround would be triggering a full GC a few times when your function is shutting down.

GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();

Another workaround would be setting the handler expiry to Timeout.InfiniteTimeSpan as documented here: https://github.com/aspnet/Extensions/blob/557995ec322f1175d6d8a72a41713eec2d194871/src/HttpClientFactory/Http/src/DependencyInjection/HttpClientBuilderExtensions.cs#L502 If your job/service is short lived there should be no drawback to that. The reason why we rotate handlers is because DNS updates don't work reliably on all existing platforms.

acds commented 5 years ago

Hi @rynowak thanks....and apologies I was away, and am just getting back to this.

the CG.Collect(); does to seem to make any difference.

I already have been experimenting with the Handler Expiry, setting to Infinite, seams to have the effect that the job still seems to not exit/sleep, but does not raise the HttpClientFactory cleanup diagnostics.

I currently have set to a continuous job, but want to revert to "triggered". The job is short lived, and will run every few hours. Having it deployed as "continuous" kinds defeats the object of having a triggered job.

Any other suggestions ?

Thanks!

acds commented 5 years ago

@fabiocav & @rynowak & @Eilon - any updates?

Not sure if this is any use - in Rider I can pull up a memory dump - and notice that the HttpClients are not disposed, even though out of scope:

my default code patterns is as follows:

private async Task<System.Net.HttpStatusCode> PatchCredentials(MyType myType)
{
    HttpClient client = _httpClientFactory.CreateClient("graph"); //using a named client
    HttpContent content = new StringContent(SerializeCredentials.ToJson(myType), Encoding.UTF8, "application/json");

    using (var httpResponse = await client.PatchAsync("", content))
    {
        httpResponse.EnsureSuccessStatusCode();
        return httpResponse.StatusCode;
    }
}

the client factory is configured as follows:

.ConfigureServices((context, services) =>
{
...

    services.AddTransient<GraphAuthenticationHandler>();
    services.AddHttpClient("graph", c =>
        {
            c.BaseAddress = new Uri($"{myURI}");
        }).AddHttpMessageHandler<GraphAuthenticationHandler().SetHandlerLifetime(Timeout.InfiniteTimeSpan);

...
}

what else should I be looking at...?

Edit:

Additional info:

code is also calling AzureServiceTokenProvider in a pair of Delegating Handlers that I believe under the covers makes HttpClient requests - is this somehow messing with things ?
my code makes 8 requests across 2x registered HttpClients configurations in the HttpClientFactory

On review of memory before I garbage collect this is the number of instances:

After the first GC.Collect();

Subsequent call to GC.Collect(); and GC.WaitForPendingFinalizers(); make no difference...

What is holding on to the HttpClients ?

Thanks for any and all help!

rynowak commented 5 years ago

@acds - sorry for the delay - I've been on vacation.

The memory dump should be able to show you why those objects are still live. https://www.jetbrains.com/help/dotmemory/Shortest_Paths_to_Roots.html You want to look for the "path to root".

@faviocb - can you provide some background about what's keeping his function live?

The host does not complete (and the console app exit) until the HttpClientFactory completes its cleanup.

What is the mechanism that does this? We don't have a feature for forcing client factory cleanup so how does this work?

acds commented 5 years ago

Hi @rynowak & @fabiocav

Here is some output from dotMemory:

There as before seem to be 6 HttpClients left hanging...
Here are the Shortest Paths:

Here is an Expansion of each:

QuickPulseTelemetartyModule (2 instances)
Configuration Root (Key Vault Connection - 1 Instance)
Lazy (1 Instance)
HttpClient - I'm not holing any references....

Any thoughts? Look like possible memory leaks (or as you point out "cleanup") in the runtime around Application Insights, Key Vault Configuration Integration, and HttpClientFactory (?)

It's a chicken-and-egg, the WebJob await host.RunAsync(); wont shut down until the references are cleared, but some of these references are being held up by the host...(via Configuration/KeyVault integration, App Insights integration, and it seems the Lazy<> references in the HttpClient Factory)!

I was suspecting that AzureServiceTokenProvider as part of Microsoft.Azure.Services.AppAuthentication was involved, but seem the .Net Core implementation is not leveraging the 'HttpClientFactory' internally.

rynowak commented 5 years ago

So that last entry points out that there are three HttpClient instances held in statics including 1 in a Lazy<HttpClient>.

It's a chicken-and-egg, the WebJob await host.RunAsync(); wont shut down until the references are cleared, but some of these references are being held up by the host...(via Configuration/KeyVault integration, App Insights integration, and it seems the Lazy<> references in the HttpClient Factory)!

@mathewc @fabiocav - is there any more information you can provide? I don't understand what WebJobs specific mechanism is causing problems here.

acds commented 5 years ago

TBH: On reflection there seems to be a conflict with using a TimerTrigger job with the expectation that this can be deployed as "triggered" not "continuous", as The TimeTrigger Job does not end...it just idles for the next trigger time.

Whist between "continuous" time triggered runs, and Idle, the HttpClientFactory is trying to clean up any HttpClient/Handler. It can't as the WebJob runtime it holding onto at least two of them (in my case, KeyVault integration and App Insights). So at least setting the SetHandlerLifetime(Timeout.InfiniteTimeSpan) eliminates the logging noise.

There seem to be (undocumented) design decisions/direction to move away from "triggered" WebJob deployments (in general):

The SDK is it seems geared toward "continuous" jobs, and the TimerTrigger job has internalized the scheduling into this model.
Despite the Dashboard functionality rhetorically being deprecated
- With "continuous" jobs it becomes less useful as there in no record of each triggered run or the losing associated with it
- The logging that is available soon fills up...becoming useless.
Although App Insights is the rhetorically documented replacement (per other Issues logged), There is no documentation on how to operationally use the App Insights to replace this logging and motoring.
- App Insights is big and complex and documentation or how best to use it in regards WebJobs would be very helpful.
There are no (documented) sample models of how to use the SDK, in a "triggered" deployment model with an external schedule file settings.job.
There is no documented indication that publish support with 3.0 SDK WebJobs will support the "triggered" deployment model

That being said, in hindsight it seem that it may be better to move the functionality to a IHostedService as a background task in the main Web/API App implementation and eliminate to deployment complexity.

At the end of the day the problems here all tie back to any real WebJob 3.0 documentation and there is likely no issue here...

One hopes that Microsoft do not intend to release future versions with out the requisite documentation, to avoid churn as we learn the hard way.

fabiocav commented 5 years ago

@acds apologies for the delayed response. Somehow, I managed to miss notifications on this issue.

You're correct on what you've outlined above.

The "triggered" WebJobs model is indeed geared towards non-SDK scenarios. Primarily scenarios where you have some existing executable and you just want the App Service/WebJobs infrastructure to trigger that on a schedule (or manually). SDK scenarios are optimized for continuous WebJobs deployments.
App Insights is the replacement for the legacy dashboard logging
For the documentation issues, (including App Insights use and configuration); yes, we are unfortunately behind and some of the problems you're running into would have likely been prevented if those items were better documented. We're actively working on updated documentation for WebJobs and those updates should start landing soon.
This is primarily for @rynowak and @glennc , as it might help answer some of the questions above. There is logic in infrastructure components (App Service/Kudu) to keep WebJobs alive. For continuous WebJobs, Kudu keep the process running and restart it if it exits. For triggered WebJobs, the same components will launch the process based on the trigger events (that really boils down to schedule based scenarios), but they will not be kept running in those cases. There's nothing the SDK would be doing to intentionally cause the behavior observed above, aside from what I mentioned below where the host is being disposed by the SDK, which really is something it shouldn't be doing.

Adding @mathewc as he can also provide more context on the WebJobs infrastructure, if needed.

srogovtsev commented 4 years ago

I am terribly sorry for barging in on a distantly connected issue, but I was trying to find a follow up on aspnet/HttpClientFactory#165, which was created from dotnet/aspnetcore#3470, which in turn led to this issue, and they are all closed for comments, and I really don't think it merits an issue in itself.

Now, I fully understand what @davidfowl says:

The logs seem indicate that things are never going out of scope and hence the supposed leak. What's happening is the timer that is used to expire the handler has a reference to a callback which seemingly keeps the HttpMessageHandler alive.

When a GC happens things are cleaned up as expected. Until then, the background scanner will keep looping until that happens.

But I still am worried: if for some reason GC doesn't clean things up for some time, I might end up having quite a few of handlers linked there being undisposed. Wouldn't that pose an issue with socket exhaustion, which was one of the original reasons for reusing the handlers? Especially that I am on .net 4.8, not Core, so I don't have access to SocketsHttpHandler.

ghost commented 4 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

ghost commented 4 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

dotnet / runtime

Investigate whether we can speed up handler cleanup when app is not receiving traffic #42143