dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.4k stars 10k forks source link

Improve thread shutdown management in ANCM #45066

Closed jdmerinor closed 1 month ago

jdmerinor commented 1 year ago

Is there an existing issue for this?

Describe the bug

I want to revive the #43651 issue, as now we have been able to create a repro case that fails basically every time for several apps. The issue seems to happen when there is a big load of CPU and we have a lot of sites starting at the same time. Some of them will start successfully, but some of them will stay in a zombie-like state stuck initializing, and the processes are there but the service won't be able to serve requests ever again and IIS does nothing to stop/reset them either.

So the problem arises for us specifically in servers that have a lot of sites installed on them (about 200 per box) and we have preload enabled for them. Some of them have a heavy initialization process, so every time we restart the server the box CPU utilization would max out for the initial 10 minutes or so. After the initialization is done the CPU comes down to normal levels.

But at this point, we would be left with a bunch of services that never came into life and are stuck in the load of aspnetcorev2_inprocess.dll as explained in the previous issue. We noticed there is a pattern in the memory consumption of the zombie processes, so it's easy to spot them once you know. In the repro cases that I created, we have a "SlowStartWebApi" which represents the app that needs to do some initialization when starting (and maximizing the CPU). Once the apps are initialized the memory is supposed to look like these (around 8 MB): normal memory consumption

But the dead ones will look something similar to these (around 4 MB):

dead memory consumption

Description From old issue #43651 as a reference

We've recently converted one of our web applications to ASP.NET Core on .NET 6 (from ASP.NET MVC/WebAPI on .NET Framework 4.8) and the new version is slowly rolling out to our customer base. This is hosted in IIS. We've had this running in production for several weeks now with no problems... until now.

This morning one of those sites appears to be hanging and was not responding to web requests. Upon further investigation by taking a process dump of w3wp.exe and examining it, it appears that the ASP.NET Core Module had hung during application initialization.

Extremely strangely, this occurred identically across two different servers, each serving the same site. This affected application is also the only ASP.NET Core application on each of those servers.

The stack trace of the only thread actually doing any kind of work is:

ntdll.dll!NtWaitForSingleObject�()
ntdll.dll!LdrpDrainWorkQueue()
ntdll.dll!LdrpLoadDllInternal()
ntdll.dll!LdrpLoadDll�()
ntdll.dll!LdrLoadDll()
KERNELBASE.dll!LoadLibraryExW()
aspnetcorev2.dll!HandlerResolver::LoadRequestHandlerAssembly(const IHttpApplication & pApplication, const std::filesystem::path & shadowCopyPath, const ShimOptions & pConfiguration, std::unique_ptr<ApplicationFactory,std::default_delete<ApplicationFactory>> & pApplicationFactory, ErrorContext & errorContext) Line 111
    at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\AspNetCore\HandlerResolver.cpp(111)
aspnetcorev2.dll!HandlerResolver::GetApplicationFactory(const IHttpApplication & pApplication, const std::filesystem::path & shadowCopyPath, std::unique_ptr<ApplicationFactory,std::default_delete<ApplicationFactory>> & pApplicationFactory, const ShimOptions & options, ErrorContext & errorContext) Line 172
    at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\AspNetCore\HandlerResolver.cpp(172)
aspnetcorev2.dll!APPLICATION_INFO::TryCreateApplication(IHttpContext & pHttpContext, const ShimOptions & options, ErrorContext & error) Line 195
    at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\AspNetCore\applicationinfo.cpp(195)
aspnetcorev2.dll!APPLICATION_INFO::CreateApplication(IHttpContext & pHttpContext) Line 106
    at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\AspNetCore\applicationinfo.cpp(106)
aspnetcorev2.dll!APPLICATION_INFO::CreateHandler(IHttpContext & pHttpContext, std::unique_ptr<IREQUEST_HANDLER,IREQUEST_HANDLER_DELETER> & pHandler) Line 63
    at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\AspNetCore\applicationinfo.cpp(63)
aspnetcorev2.dll!ASPNET_CORE_PROXY_MODULE::OnExecuteRequestHandler(IHttpContext * pHttpContext, IHttpEventProvider * __formal) Line 103
    at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\AspNetCore\proxymodule.cpp(103)
iiscore.dll!NOTIFICATION_CONTEXT::RequestDoWork()
iiscore.dll!NOTIFICATION_CONTEXT::CallModulesInternal()
iiscore.dll!NOTIFICATION_CONTEXT::CallModules(int,unsigned long,long,unsigned long,class W3_CONTEXT_BASE *,class IHttpEventProvider *)
iiscore.dll!NOTIFICATION_MAIN::DoWork()
iiscore.dll!W3_CONTEXT_BASE::StartNotificationLoop(class NOTIFICATION_CONTEXT *,int)
iiscore.dll!APPLICATION_PRELOAD_PROVIDER::ExecuteRequest(class IHttpContext *,class IHttpUser *)
warmup.dll!DoApplicationPreload(class IGlobalApplicationPreloadProvider *)
iiscore.dll!W3_SERVER::GlobalNotify()
iiscore.dll!W3_SERVER::NotifyApplicationPreload(int)
iiscore.dll!IISCORE_PROTOCOL_MANAGER::PreloadApplication(unsigned long,unsigned short const *,int)
w3wphost.dll!W3WP_HOST::ProcessHttpPreloadApplications(int)
w3wphost.dll!W3WP_HOST::ProcessPreloadApplications(unsigned long)
w3wphost.dll!WP_IPM::AcceptMessage()
iisutil.dll!IPM_MESSAGE_PIPE::MessagePipeCompletion(void *,unsigned char)
ntdll.dll!RtlpTpWaitCallback()
ntdll.dll!TppExecuteWaitCallback()
ntdll.dll!TppWorkerThread()
kernel32.dll!BaseThreadInitThunk�()
ntdll.dll!RtlUserThreadStart�()
The path parameter being passed by `aspnetcorev2.dll`'s `LoadRequestHandlerAssembly` to `LoadLibrary` is `C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App\6.0.7\aspnetcorev2_inprocess.dll`.

The stack trace also suggests to me that this was happening in application preloading (we do have preload enabled).

I have a process dump available on request, if you have somewhere secure that I can upload it.

Expected Behavior

After slow start of all sites all of them are initialized correctly and able to serve requests properly

Steps To Reproduce

I can provide a VM with everything already installed and failing every time there is a reset. I can also provide memory dumps of the failing processes if required as well.

Introduction to the repro project

Uploaded the basic project to https://github.com/jdmerinor/aspdotnetcorehangingreprocase This contains 3 pieces:

  1. SlowStartWebApi which is the server that represents the slow start service that consumes a good chunk of CPU, this is nothing super special really is the typical dotnet 6 example web api weather app but with these lines to make it use a bunch of CPU before calling app.Run();:
using var pacho = SHA512.Create();
var buffer = Encoding.ASCII.GetBytes("sadfhasdhfhasdfklsadjhfklsdahfojhdsaf");
while (stopwatch.Elapsed < TimeSpan.FromMinutes(1.5))
{
    count--;
    pacho.ComputeHash(buffer);
    count++;
}
  1. AppInstaller which is just using the Microsoft.Web.Administration to install all the sites in a convenient way (100 of them) with the following settings: Settings line
    
    var appPool = serverManager.ApplicationPools.Add(appPoolName);
    appPool.ProcessModel.LoadUserProfile = false;
    appPool.Recycling.PeriodicRestart.Time = TimeSpan.Zero;
    appPool.ManagedPipelineMode = ManagedPipelineMode.Integrated;
    appPool.StartMode = StartMode.AlwaysRunning;
    appPool.ProcessModel.IdleTimeout = TimeSpan.Zero;
    appPool.Enable32BitAppOnWin64 = false;
    appPool.ManagedRuntimeVersion = string.Empty;

application.ApplicationPoolName = appPool.Name; application["preloadEnabled"] = true; //IMPORTANT FOR THE REPRODUCTION


3. [UIServer](https://github.com/jdmerinor/aspdotnetcorehangingreprocase/tree/main/UIServer) which is just a normal dotnet 6 example web api weather app. I added this one because I wasn't sure if to repro the bug I needed more than one app per site... I have the feeling it will probably fail without it but I left it here for completeness.

### Reproduction steps
I followed the following steps to reproduce the issue:
1. Download a clean Windows server vhd from https://www.microsoft.com/en-us/evalcenter/download-windows-server-2019
2. Add IIS role and make sure the Application Initialization inside the Application Development is also installed.
![Application initialization image](https://user-images.githubusercontent.com/14632949/201545880-0a7c3b27-91ac-40a7-96cd-d06099ae15eb.png)

4. Install dotnet 6 bundle from https://dotnet.microsoft.com/en-us/download/dotnet/thank-you/runtime-aspnetcore-6.0.11-windows-hosting-bundle-installer
5. Put the apps publish folder in the desktop (using `dotnet publish --configuration Release`)
7. Make sure IIS_IUSRS is added to the folder permissions.
8. Make sure the server has a selfsigned or valid https certificate for the sites (otherwise you will get connection refused when trying to connect through https to the apps)
9. Use the app installer to add a bunch of apps (might need to change the code to suit your server paths)
10. Add hosts file entries to be able to hit the different sites from within the server
11. Restart the server (This is going to initially take a while because the "SlowServer" is doing a bunch of hashing to max the CPU...)
12. Once the initialization is done you can then try to call some of the endpoints and you will realize that they are just not responding at all... they will just load forever and they won't ever recover. It's easy to spot the ones that are broken because they would have really slow memory allocated to them in the w3wp.exe process

### Exceptions (if any)

_No response_

### .NET Version

6.0.11

### Anything else?

_No response_
yaakov-h commented 1 year ago

For what it's worth, this problem still occurs with parallel DLL loading disabled (MaxLoaderThreads = 1), and on Windows Server 2012 which predates parallel DLL loading.

The stack trace there differs slightly between LoadLibraryExW and NtWaitForSingleObject but its still pretty much the same.

ctaswell commented 1 year ago

We have experienced the problem on a Windows 2019 Server (with current up-to-date patches) running IIS with only several dozen websites. In other words, the problem can occur even without a "lot of sites" defined above as "200 per box". The websites that hang must be manually restarted after a server reboot.

adityamandaleeka commented 1 year ago

@jdmerinor Looks like this is caused by some code using TerminateThread in a case where the target thread is holding the loader lock. I'm going to look into how to fix this... including whether we can rework this code to eliminate the use of TerminateThread entirely.

The reason why you see this happening when under load is that the termination is in response to a timeout (which is of course more likely to happen when things are slow).

ctaswell commented 1 year ago

When you have a fix for this problem, please send out a notice to all of us on this thread. Are you aware if the problem also occurs on Windows Server 2022? In other words, is the immediate fix for us to upgrade from Windows Server 2019 to Windows Server 2022?

I ask because we are considering whether to buy new server hardware on which we will install Windows Server 2022.... and it has been really annoying on our Windows Server 2019 machine to be obligated to restart manually a bunch of websites even if only several dozen instead of several hundred, no less several thousand.

Thanks, Carl

From: Aditya Mandaleeka @.> Sent: Thursday, June 29, 2023 2:22 PM To: dotnet/aspnetcore @.> Cc: Carl Taswell @.>; Comment @.> Subject: Re: [dotnet/aspnetcore] ANCM native module hanging in early initialisation (Issue #45066)

@jdmerinor https://github.com/jdmerinor Looks like this is caused by some code using TerminateThread in a case where the target thread is holding the loader lock. I'm going to look into how to fix this... including whether we can rework this code to eliminate the use of TerminateThread entirely.

The reason why you see this happening when under load is that the termination is in response to a timeout (which is of course more likely to happen when things are slow).

— Reply to this email directly, view it on GitHub https://github.com/dotnet/aspnetcore/issues/45066#issuecomment-1613825383 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFVHNU6T72S5YDTEG2LZV3XNXWWJANCNFSM6AAAAAAR7HF5UA . You are receiving this because you commented. https://github.com/notifications/beacon/AAFVHNQOVRWJP2A23WWZEZTXNXWWJA5CNFSM6AAAAAAR7HF5UCWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTAGECWO.gif Message ID: @. @.> >

adityamandaleeka commented 1 year ago

@ctaswell Yea, I'll update this thread when there is a fix.

I have no evidence at present to indicate this would be fixed by moving to Server 2022. Fundamentally, the code in question is doing something dangerous that it shouldn't be. That said, this issue is partially timing related, so there's a chance the behavior will change when you do the hardware update.

adityamandaleeka commented 1 year ago

As a mitigation until this issue is resolved, you could consider whether running without preload enabled is an option for your applications. Doing that could spread out the initialization costs a bit so not everything is slamming the CPU at the same time. The OP noted above that they don't see this issue when preload is disabled.

danmoseley commented 1 year ago

consider whether running without preload enabled is an option for your applications

If preload is important, could it potentially be achieved manually using a script that hits the apps, doing so progressively enough that they don't hit this bug?

yaakov-h commented 1 year ago

@danmoseley That would reduce the chance of triggering this bug, but it can still happen even without preload enabled under other conditions involving high CPU utilisation.

I don't think we posted it in this thread, but it was in the correspondence with Premier Support - the main alternative repro we have for this is running 7-Zip to do a large compression job whilst starting a single ASP.NET Core site.

adityamandaleeka commented 1 year ago

@ctaswell @jdmerinor @yaakov-h I have some PRs (https://github.com/dotnet/aspnetcore/pull/49696 and https://github.com/dotnet/aspnetcore/pull/49942) targeted for .NET 8 RC 1 that should help with this. I effectively removed the use of TerminateThread from the filewatcher logic entirely, and for StandardStreamRedirection, I opted to enable the timeout to be configurable (in your case, since it's taking more than 2 seconds for the thread to exit, you'd set the ASPNETCORE_OUTPUT_REDIRECTION_TERMINATION_TIMEOUT_MS env variable to something much higher than the default 2000 to give it more time to exit).

I'm hoping that these changes will be enough for you to avoid this bug. Assuming https://github.com/dotnet/aspnetcore/pull/49942 gets merged this week it will be available in .NET 8 RC 1. It would be great if you could try out that version and provide feedback when it ships.

yaakov-h commented 1 year ago

@adityamandaleeka Thanks, we'll definitely give that a shot once its available.

Since these are changes in the native module, would we be able to take advantage of them in a .NET 6 web application by installing the .NET 8 Windows Hosting Bundle, or would the entire web application have to be updated to .NET 8?

adityamandaleeka commented 1 year ago

Thanks @yaakov-h. I believe you are correct that you can keep your app on 6.0 and it should use the aspnetcorev2.dll dropped by the 8.0 hosting bundle, which will include this fix.

yaakov-h commented 12 months ago

@adityamandaleeka FWIW we've deployed .NET 8.0 RC1 and RC2 Hosting Bundles across our servers with ASPNETCORE_OUTPUT_REDIRECTION_TERMINATION_TIMEOUT_MS set to the max (1800000) and haven't seen the original issue reoccur. 😃

adityamandaleeka commented 12 months ago

That's great to hear, thanks for the update @yaakov-h!

adityamandaleeka commented 1 month ago

Closing since this is done; https://github.com/dotnet/aspnetcore/issues/49959 covers more that can be done.