dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.12k stars 4.7k forks source link

.Net Core 2.1 with Azure KeyVault: 'System.ExecutionEngineException' occurred in Unknown Module #10362

Closed andriysavin closed 4 years ago

andriysavin commented 6 years ago

I'm migrating a project to .Net Core 2.1/ASP.NET Core 2.1 RC1. All works nicely on a local windows 10 machine, but fails to start when deployed to Azure AppService (WebApp). After long investigation I was able to catch this error using remote debugging, and it looks like this:

image

"An unhandled exception of type 'System.ExecutionEngineException' occurred in Unknown Module. Occurred"

I narrowed the problem down and it turned to be rooted at KeyVault configuration provider from ASP.NET Core (Microsoft.Extensions.Configuration.AzureKeyVault 2.1.0-rc1-final). I tried to strip out the configuration provider wrapper code and use KeyVaultClient directly. The error happens while getting secrets but even before querying an access token. But the stack trace is not very informative (or the problem is in the async/await and tasks infrastructure):

image

(Only one thread was present at exception throw point of time)

So, even though this error appears when using an "external" library, I believe its source lies in CoreCLR. Also, this problem is not reproduceable when target framework is net471; and once again, its only reproduceable within Azure environment, not locally. For publishing to azure I used self-contained approach (but also tried the approach with installing ASP.NET Core 2.1 as a shared framework using web site extension).

Essentially, this problem prevents any ASP.NET Core 2.1 app from working if it uses Azure KeyVault - and this can be a huge impact, especially having MS claimed this RC can be used for production.

Some dotnet/system info from AppService console:

\> dotnet --info
.NET Core SDK (reflecting any global.json):
 Version:   2.1.300-rc1-008673
 Commit:    f5e3ddbe73

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.14393
 OS Platform: Windows
 RID:         win10-x86
 Base Path:   D:\Program Files (x86)\dotnet\sdk\2.1.300-rc1-008673\

Host (useful for support):
  Version: 2.1.0-rc1
  Commit:  eb9bc92051

.NET Core SDKs installed:
  1.1.7 [D:\Program Files (x86)\dotnet\sdk]
  1.1.8 [D:\Program Files (x86)\dotnet\sdk]
  2.1.4 [D:\Program Files (x86)\dotnet\sdk]
  2.1.101 [D:\Program Files (x86)\dotnet\sdk]
  2.1.300-rc1-008673 [D:\Program Files (x86)\dotnet\sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.0-rc1-final [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.0-rc1-final [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 1.0.9 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 1.0.10 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 1.1.6 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 1.1.7 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.5 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.6 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0-rc1 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]

I've uploaded a sample project for reproducing the problem. Note that you will have to fill some configuration values I had to strip out as sensitive info, like clientId, cert thumbprint and vault name. However, I think that using a self-signed cert at correct location will be enough, and for client id an arbitrary string can be used, since the app crashes before authentication starts. https://1drv.ms/u/s!AuaAKPMkiTEAxqgI1XPVLhr0TpFdSg

RussKeldorph commented 6 years ago

@jkotas @stephentoub Is there enough here for runtime or FX to investigate, or should we move to aspnet/Home?

RussKeldorph commented 6 years ago

@muratg

jkotas commented 6 years ago

@RussKeldorph The minimal repro is not using ASP.NET at all. The repro is using X509 certificates and networking from CoreFX. So the problem is likely either in those; or in the runtime itself.

@andriysavin Would it be possible to attach native code debugger to the process and look for thread that hit the crash? It would help us to narrow the root cause.

@bartonjs Could you please try to reproduce this issue? I am having troubles to setup the keyvault right to make the repro work.

andriysavin commented 6 years ago

@jkotas I can try to do that and post a screenshot with threads, stacks etc. here. I can't provide you with access to that environment, unfortunately.

jkotas commented 6 years ago

Yes, that's what I meant. I understand you cannot give us access. Thank you!

andriysavin commented 6 years ago

@jkotas I've just remembered, when you attach remote debugger to azure app service, it doesn't give you ability to choose types of debugger and attaches only managed debugger. Any advice how to choose both managed and native?

jkotas commented 6 years ago

@tommcdon Is there a way to attach native debugger to crashed Azure AppService service?

andriysavin commented 6 years ago

@jkotas Ok, I figured it out. However, I'm not able to reproduce this issue anymore. The only exception I could catch is on the screenshot below. I tried to switch my whole web app back to CoreCLR and publish - and now it works. That's weird, since previously I even tried to recreate the azure app service, and the issue was reproduceable. I'll play with reproducing this a bit more tomorrow.

image

andriysavin commented 6 years ago

Some news: the reason bug stopped to reproduce is somehow connected with networking. The web app uses network integration app service feature to access an on-premises database. Recently that connection was broken because some updates to VPN Gateway was made. So the app was not able to connect the on-prem DB even though network integration was enabled, but was able to start, load KeyVault secrets etc. And this was just at the same time as the bug stopped to reproduce. Today we fixed network integration, and the bug immediately started appearing again. I'm trying to get some details from remote debugging.

andriysavin commented 6 years ago

Here is the call stack of the thread where the exception is thrown: image

andriysavin commented 6 years ago

Parallel Stacks: image

andriysavin commented 6 years ago

Threads: image

jkotas commented 6 years ago

Thanks, this is very useful. The picohelper.dll where this is crashing is a special AppServices .dll. It is not present when you run on your local machine.

@davidebbo Could you or somebody on your team please take a look?

andriysavin commented 6 years ago

@jkotas Is that related to windows pico processes? (just curious)

jkotas commented 6 years ago

I do not know - I do not work on AppServices. The name suggests that it may be the case.

davidebbo commented 6 years ago

Adding @bilalalam123 & @stekou to comment on picohelper part.

So the minimal repro is the one above from @andriysavin? And that's just a console app that we can drop in an Azure Web App and see the issue?

andriysavin commented 6 years ago

@davidebbo not that simple, the app fails when communicating with Azure KeyVault, so to reproduce you will have to setup KeyVault. Also, as I commented above, it may or may not reproduce depending on presence of Network Integration feature.

davidebbo commented 6 years ago

Ok, I'll try to find the right people on the team to help take a look.

Would you be able to share the name of an App Service where you are able to repro? And the approximate UTC time where you hit this? This will help us look things up on our side. Thanks!

andriysavin commented 6 years ago

I've created a dedicated app service named TestKeyVaultWithNetCore21RC1. Time when it crashed: 10:28 PM UTC. Note: again, it was working correctly until I added Network Integration and restarted the service.

davidebbo commented 6 years ago

@andriysavin thanks much!

davidebbo commented 6 years ago

@andriysavin and to clarify, is this a repro that you ran as a Web App (triggered by an http request), or as a Console app that you ran manually via Kudu Console?

andriysavin commented 6 years ago

Manually.

andriysavin commented 6 years ago

Good news. I was able to reduce the application to a very trivial code with no additional dependencies like key vault. Note: this is reproduceable when running either manually from kudu console or when it's run after making an http request to the web app's URL. I use the first approach (with commented out WaitForDebugger()) to check fast if it's failing after deployment, and the second one to attach remote debugger.

using System;
using System.Diagnostics;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;

namespace TestKeyVaultWithNetCore21RC1
{
    class Program
    {
        static async Task Main()
        {
            // Uncomment to use remote debugging with Azure.
            WaitForDebugger();

            Console.WriteLine("Connecting...");

            HttpClient httpClient = new HttpClient();

            var task = httpClient.GetAsync("https://microsoft.com");

            Console.WriteLine("Awaiting task...");

            // Awaiting the task separately to better understand
            // if the exception is thrown during async operation initiation
            // or during awaiting it.
            var response = await task;

            // ExecutionEngineException is thrown before we get here.

            Console.WriteLine("Response: " + response.StatusCode);
        }

        private static void WaitForDebugger()
        {
            while (!Debugger.IsAttached)
            {
                Console.WriteLine("Waiting for debugger");
                Thread.Sleep(2000);
            }

            Debugger.Break();
        }

    }
}
davidebbo commented 6 years ago

@andriysavin I was not able to repro this using the following steps:

And it seems to work fine:

D:\home\tmp>dotnet TestKeyVaultWithNetCore21RC1.dll
Connecting...
Awaiting task...
Response: OK

Are you doing anything differently?

andriysavin commented 6 years ago

@davidebbo The key point is that in my app service there is Network Integration enabled (read my comment above). Without it the bug doesn't reproduce.

davidebbo commented 6 years ago

@andriysavin Ok, the issue is now understood. Can you please share your app name, directly or indirectly, so we can prioritize fixing it? Thanks!

andriysavin commented 6 years ago

@davidebbo By this

the issue is now understood

do you mean you were able to grasp where the issue comes from?

And by

share your app name

do you mean the full app which is affected by the bug, or the sample service is enough? The thing is that currently I don't have any real apps working on .Net Core 2.1 runtime (since it's just doesn't work). Some remain to run on .Net Core 2.0, others run on .Net Framework runtime (one with ASP.NET Core 2.1 on top of it). But you can use the app I previously named (TestKeyVaultWithNetCore21RC1), it's almost what the indirect sharing approach suggests.

davidebbo commented 6 years ago

@andriysavin yes, the engineer who owns this area understands the problem.

Ok, let's start by looking at the app TestKeyVaultWithNetCore21RC1.

andriysavin commented 6 years ago

@davidebbo I'm happy to hear that! If it will still be hard to catch that error, I can provide you with a memory dump of the process at time of crash (only if my employer allows me to, of course).

aleksbol commented 6 years ago

@andriysavin I've identified the cause of this and have made a fix in App Service. It should be getting deployed over the course of the next week or two.

andriysavin commented 6 years ago

Thanks @aleksbol and everybody, that's great news! Will be checking for the fix periodically.

RussKeldorph commented 6 years ago

Sounds like there's nothing to track in this repo. Feel free to re-open if I misunderstood.

andriysavin commented 6 years ago

Today I was able to run the test app as well as full web app on netcoreapp21 RTM runtime in the AppService. Not sure if that's because of the fix or because of the runtime update (to RTM), but currently all works fine, thanks everybody again!