dotnet / sdk

Core functionality needed to create .NET Core projects, that is shared between Visual Studio and CLI
https://dot.net/core
MIT License
2.72k stars 1.07k forks source link

Failed to initialize CoreCLR, HRESULT: 0x80004005 #7204

Closed akusmin closed 4 years ago

akusmin commented 7 years ago
"dependencies": {
    "Microsoft.NETCore.App": {
      "version": "1.0.1",
      "type": "platform"
    },
    "Microsoft.AspNetCore.Mvc": "1.0.1",
    "Microsoft.AspNetCore.Routing": "1.0.1",
    "Microsoft.AspNetCore.Server.IISIntegration": "1.0.0",
    "Microsoft.AspNetCore.Server.Kestrel": "1.0.1",
    "Microsoft.Extensions.Configuration.EnvironmentVariables": "1.0.0",
    "Microsoft.Extensions.Configuration.FileExtensions": "1.0.0",
    "Microsoft.Extensions.Configuration.Json": "1.0.0",
    "Microsoft.Extensions.Logging": "1.0.0",
    "Microsoft.Extensions.Logging.Console": "1.0.0",
    "Microsoft.Extensions.Logging.Debug": "1.0.0",
    "Microsoft.Extensions.Options.ConfigurationExtensions": "1.0.0"
  },
"frameworks": {
    "netcoreapp1.0": {
      "imports": [
        "dotnet5.6",
        "portable-net45+win8"
      ]
    }
  },

  "runtimes": {
    "ubuntu.14.04-x64": {}
  },

Steps to reproduce

making default WebAPI project in VS publish to remote Ubuntu server dotnet WebApplication3.dll

Expected behavior

Now listening on: http://*:11901 Application started. Press Ctrl+C to shut down.

Actual behavior

Failed to initialize CoreCLR, HRESULT: 0x80004005

Environment data

dotnet --info output: Windows: .NET Command Line Tools (1.0.0-preview2-1-003177)

Product Information: Version: 1.0.0-preview2-1-003177 Commit SHA-1 hash: a2df9c2576

Runtime Environment: OS Name: Windows OS Version: 10.0.14393 OS Platform: Windows RID: win10-x64

Linux: .NET Command Line Tools (1.0.0-preview2-1-003177)

Product Information: Version: 1.0.0-preview2-1-003177 Commit SHA-1 hash: a2df9c2576

Runtime Environment: OS Name: ubuntu OS Version: 14.04 OS Platform: Linux RID: ubuntu.14.04-x64

Looks there is some environment issue. I succeeded to start web app on local Ubuntu but failed to start on remote. I cannot find any description of an issue. Please help to investigate.

livarcocc commented 7 years ago

You are targeting runtime 1.0.1 with this:

"Microsoft.NETCore.App": {
      "version": "1.0.1",
      "type": "platform"
    }

But you have preview2-1 installed, which brings the 1.1.0 shared framework. So, either you need to update your Microsoft.NetCore.App to 1.1.0 or you need to install the 1.0.1 shared framework in the machines where you want to run your site.

akusmin commented 7 years ago

@livarcocc yes. is was wrong configuration, but with Microsoft.NetCore.App 1.1.0 same story...

"dependencies": {
    "Microsoft.AspNetCore.Mvc": "1.1.0",
    "Microsoft.AspNetCore.Routing": "1.1.0",
    "Microsoft.AspNetCore.Server.IISIntegration": "1.1.0",
    "Microsoft.AspNetCore.Server.IISIntegration.Tools": "1.1.0-preview4-final",
    "Microsoft.AspNetCore.Server.Kestrel": "1.1.0",
    "Microsoft.Extensions.Configuration.EnvironmentVariables": "1.1.0",
    "Microsoft.Extensions.Configuration.FileExtensions": "1.1.0",
    "Microsoft.Extensions.Configuration.Json": "1.1.0",
    "Microsoft.Extensions.Logging": "1.1.0",
    "Microsoft.Extensions.Logging.Console": "1.1.0",
    "Microsoft.Extensions.Logging.Debug": "1.1.0",
    "Microsoft.Extensions.Options.ConfigurationExtensions": "1.1.0",
    "Microsoft.NETCore.App": {
      "version": "1.1.0",
      "type" : "platform"
    }
  },

  "tools": {
  },

  "frameworks": {
    "netcoreapp1.0": {
    }
  },

  "runtimes": {
    "ubuntu.14.04-x64": {} 
  },
.NET Command Line Tools (1.0.0-preview2-1-003177)

Product Information:
 Version:            1.0.0-preview2-1-003177
 Commit SHA-1 hash:  a2df9c2576

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  14.04
 OS Platform: Linux
 RID:         ubuntu.14.04-x64

is there any way to trace what is going on?

TheRealPiotrP commented 7 years ago

I looked at the 1.1 templates here and found that we're using the netcoreapp1.1 framework. That's the only difference I can discern but you may find others if you compare the PJ files. I'm going to reopen this issue as a discussion until you get into a good state.

akusmin commented 7 years ago

@piotrpMSFT , thanks for your feedback. I've tried netcoreapp1.1 also but no result.

"dependencies": {
    "Microsoft.AspNetCore.Mvc": "1.1.0",
    "Microsoft.AspNetCore.Routing": "1.1.0",
    "Microsoft.AspNetCore.Server.IISIntegration": "1.1.0",
    "Microsoft.AspNetCore.Server.IISIntegration.Tools": "1.1.0-preview4-final",
    "Microsoft.AspNetCore.Server.Kestrel": "1.1.0",
    "Microsoft.Extensions.Configuration.EnvironmentVariables": "1.1.0",
    "Microsoft.Extensions.Configuration.FileExtensions": "1.1.0",
    "Microsoft.Extensions.Configuration.Json": "1.1.0",
    "Microsoft.Extensions.Logging": "1.1.0",
    "Microsoft.Extensions.Logging.Console": "1.1.0",
    "Microsoft.Extensions.Logging.Debug": "1.1.0",
    "Microsoft.Extensions.Options.ConfigurationExtensions": "1.1.0",
    "Microsoft.NETCore.App": {
      "version": "1.1.0",
      "type" : "platform"
    }
  },

  "tools": {
  },

  "frameworks": {
    "netcoreapp1.1": {
    }
  },

  "runtimes": {
    "ubuntu.14.04-x64": {} 
  },

Same error: Failed to initialize CoreCLR, HRESULT: 0x80004005

also let me note that i'm using dotnet-install-script command: dotnet-install.sh --version 1.0.0-preview2-1-003177 then export to PATH variable dotnet folder.

previous dotnet installation was removed using script from official manuals: https://github.com/dotnet/cli/blob/rel/1.0.0/scripts/obtain/uninstall/dotnet-uninstall-debian-packages.sh

Regards

livarcocc commented 7 years ago

@akusmin I must be missing something. You said that your environment in windows, however you are using the debian script to uninstall it?

Can you tell us what OS and architecture you are running?

akusmin commented 7 years ago

@livarcocc partially.. I mentioned two environments: Ubuntu and Windows. I'm building sources on Windows and publishing to remote Ubuntu server. Also I checked on a local Ubuntu and everything is great.

I used debian script to uninstall dotnet from remote Ubuntu server.

Remote Ubuntu desc:

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty
uname -p
x86_64

Regards

TheRealPiotrP commented 7 years ago

@schellap @gkhanna79 can you guys take a look? I think we've exhausted available options for debugging CoreCLR initialization issues...

akusmin commented 7 years ago

@piotrpMSFT I found solution described in the issue 1488

Setting System.GC.Server to false allowed to start application. I suppose that my issue with initializing CoreCLR has the same root with described one in the issue above.

Regards

TheRealPiotrP commented 7 years ago

@livarcocc we should really provide a better UX here. Do we have enough state to recommend action to the user? The investigation here was certainly frustrating for @akusmin and others.

richardzeng1985 commented 7 years ago

This issue also repro'ed in Windows 10 RS2 (WSL) scenarios. 1) "dotnet run" for console application - Works correctly. 2) "dotnet run" for web application - Run into the same issue.

When I try to set "System.GC.Server": false in project.json file, then this issue doesn't repro anymore.

"runtimeOptions": { "configProperties": { "System.GC.Server": false } },

richardzeng1985 commented 7 years ago

I finally figured out that this is due to the different minimal memory required when run dotnet console application and web application.

For my scenario (WSL), I set 8 GB startup memory and just enabled "Use Dynamic Memory for this virtual machine" when created the VM and the "Minimum RAM"is set 512 MB by default, thus the actual total memory on VM will be only set around 2GB. When I try to set "Minimum RAM" to 3072 MB by default Or do NOT enable the Dynamic Memory (keep the startup memory at least 3 GB), then this issue doesn't repro anymore for web application.

This issue only repro'ed when run dotnet web application, is this expected behavior for web application scenario ? I suppose we should at least give some more better error messages indicate such situation so that might help customers narrow down the root cause ?

ghost commented 7 years ago

I encountered the same issue on a fresh debian 8 (x64) server that I just installed, with 8 GB memory. I used dotnet 1.0.0-rc4-004771.

dotnet new mvc
dotnet restore
dotnet run

=> Failed to initialize CoreCLR, HRESULT: 0x80004005

Setting ServerGarbageCollection property to false in the csproj file resolves the issue:

<Project Sdk="Microsoft.NET.Sdk.Web">

  <PropertyGroup>
    <TargetFramework>netcoreapp1.0</TargetFramework>
    <ServerGarbageCollection>false</ServerGarbageCollection>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="Microsoft.AspNetCore" Version="1.0.3" />
    <PackageReference Include="Microsoft.AspNetCore.Mvc" Version="1.0.2" />
    <PackageReference Include="Microsoft.AspNetCore.StaticFiles" Version="1.0.1" />
    <PackageReference Include="Microsoft.Extensions.Logging.Debug" Version="1.0.1" />
    <PackageReference Include="Microsoft.VisualStudio.Web.BrowserLink" Version="1.0.1" />
  </ItemGroup>
</Project>
dotnet run

=> Hosting environment: Production
Content root path: [...]
Now listening on: http://localhost:5000
Application started. Press Ctrl+C to shut down.

Same results when I use netcoreapp1.1 (with dependencies in version 1.1.0) instead of netcoreapp1.0.

Note that when I use the dotnet new console template instead of dotnet new mvc, the application works correctly in all situations (with ServerGarbageCollection set to either true or false).

gkhanna79 commented 7 years ago

CC @brianrob @Maoni0 to help determine how Server GC initialization would affect minimum requirements on memory.

gkhanna79 commented 7 years ago

CC @Maoni0

Maoni0 commented 7 years ago

CC @swgillespie. Sean is getting a repro for us to debug.

swgillespie commented 7 years ago

@sebok How many cores does your machine have?

ghost commented 7 years ago

@swgillespie It has 4 cores.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 42
Model name:            Intel(R) Core(TM) i3-2130 CPU @ 3.40GHz
Stepping:              7
CPU MHz:               1600.257
BogoMIPS:              6784.45
Virtualization:        VT-x
swgillespie commented 7 years ago

@sebok Thank you! And to be clear, are you also building on Windows and copying to Debian like @akusmin or are you compiling directly on the target machine?

ghost commented 7 years ago

I compile only on Debian, and I don't modify any file after the dotnet new command.

swgillespie commented 7 years ago

@sebok Unfortunately I'm having trouble reproing this - do you mind post/sharing the output of strace dotnet run?

ghost commented 7 years ago

Here it is: strace-dotnet-run.txt

swgillespie commented 7 years ago

@sebok Ack, sorry, I guess dotnet run runs the target process as a child process - do you mind doing it again with dotnet <path to the built .dll>? Sorry!

ghost commented 7 years ago

Ah, yes. Here is the result of dotnet <dll>: strace-dotnet-dll.txt

ghost commented 7 years ago

@swgillespie Were you able to repro? Any idea of what could be the issue? I stay available to help.

ghost commented 7 years ago

I just tried with dotnet version released today: the ASP.NET application sometimes starts successfully and sometimes I still get 0x80004005 error.

$ dotnet --info
.NET Command Line Tools (1.0.1)

Product Information:
 Version:            1.0.1
 Commit SHA-1 hash:  005db40cd1

Runtime Environment:
 OS Name:     debian
 OS Version:  8
 OS Platform: Linux
 RID:         debian.8-x64
 Base Path:   /opt/dotnet/sdk/1.0.1

$ dotnet new mvc
Content generation time: 332.792 ms
The template "ASP.NET Core Web App" created successfully.

$ dotnet restore
  Restoring packages for /home/me/essai-1.0-mvc/essai-1.0-mvc.csproj...
  Generating MSBuild file /home/me/essai-1.0-mvc/obj/essai-1.0-mvc.csproj.nuget.g.props.
  Generating MSBuild file /home/me/essai-1.0-mvc/obj/essai-1.0-mvc.csproj.nuget.g.targets.
  Writing lock file to disk. Path: /home/me/essai-1.0-mvc/obj/project.assets.json
  Restore completed in 1.25 sec for /home/me/essai-1.0-mvc/essai-1.0-mvc.csproj.

  NuGet Config files used:
      /home/me/.nuget/NuGet/NuGet.Config

  Feeds used:
      https://api.nuget.org/v3/index.json
      /home/me/.nuget/localrepo

$ dotnet run
Hosting environment: Production
Content root path: /home/me/essai-1.0-mvc
Now listening on: http://localhost:5000
Application started. Press Ctrl+C to shut down.
^CApplication is shutting down...

$ dotnet run
Failed to initialize CoreCLR, HRESULT: 0x80004005

$ dotnet run
Hosting environment: Production
Content root path: /home/me/essai-1.0-mvc
Now listening on: http://localhost:5000
Application started. Press Ctrl+C to shut down.
^CApplication is shutting down...

$ dotnet run
Failed to initialize CoreCLR, HRESULT: 0x80004005

$ dotnet run
Failed to initialize CoreCLR, HRESULT: 0x80004005

$ dotnet run
Hosting environment: Production
Content root path: /home/me/essai-1.0-mvc
Now listening on: http://localhost:5000
Application started. Press Ctrl+C to shut down.
^CApplication is shutting down...
Maoni0 commented 7 years ago

@sebok, thank you for offering to help! we are having trouble reproing this issue so I was wondering if you would be interested in doing a bit of debugging with us :) it would be most appreciated. would it be possible that we could chat on gitter (the coreclr channel)?

ghost commented 7 years ago

@Maoni0, I would indeed be interested ;) I'll join you on the channel.

Maoni0 commented 7 years ago

@sebok, thank you!

swgillespie commented 7 years ago

So @sebok, @Maoni0 and I took a look at this on Gitter and discovered that the point of failure is after the GC, during initialization of the SystemDomain appdomain. In particular, it fails here:

Breakpoint 1, SystemDomain::Init (this=0x7101f421b980 <g_pSystemDomainMemory>) at /home/ed/coreclr/src/vm/appdomain.cpp:2553
2553        m_BaseLibrary.Append(g_pwBaseLibrary);
(gdb) n
2518        if (CLRConfig::GetConfigValue(CLRConfig::EXTERNAL_ZapDisable) != 0)
(gdb) n
2523        m_pSystemAssembly = NULL;
(gdb) n
2525        DWORD size = 0;
(gdb) n
2530        hr = GetInternalSystemDirectory(NULL, &size);
(gdb) n
2531        if (hr != HRESULT_FROM_WIN32(ERROR_INSUFFICIENT_BUFFER))
(gdb) n
2535        WCHAR *buffer = m_SystemDirectory.OpenUnicodeBuffer(size-1);
(gdb) n
2536        IfFailThrow(GetInternalSystemDirectory(buffer, &size));
(gdb) n
2537        m_SystemDirectory.CloseBuffer();
(gdb) n
2538        m_SystemDirectory.Normalize();
(gdb) n
2548        m_BaseLibrary.Append(m_SystemDirectory);
(gdb) n
2549        if (!m_BaseLibrary.EndsWith(DIRECTORY_SEPARATOR_CHAR_W))
(gdb) n
2553        m_BaseLibrary.Append(g_pwBaseLibrary);
(gdb) n
2554        m_BaseLibrary.Normalize();
(gdb) n
2556        LoadBaseSystemClasses();
(gdb) n
Failed to initialize CoreCLR, HRESULT: 0x80004005
[Thread 0x7101e9874700 (LWP 28530) exited]
[Thread 0x70fd8e1fa700 (LWP 28532) exited]
[Thread 0x7101ea0ac700 (LWP 28529) exited]
[Thread 0x7101ea8ec700 (LWP 28528) exited]
[Thread 0x7101f080b700 (LWP 28527) exited]
[Thread 0x7101ebf8f700 (LWP 28526) exited]
[Thread 0x7101f1057700 (LWP 28525) exited]
[Thread 0x7101f18d8700 (LWP 28524) exited]
[Thread 0x7101f2113700 (LWP 28523) exited]
[Thread 0x7101f59b8740 (LWP 28519) exited]
[Inferior 1 (process 28519) exited with code 0211]

Looking at @sebok's strace, one of the last things that occured before initialization failed was the mapping of System.Private.CoreLib.ni.dll into the process:

stat("/opt/dotnet/shared/Microsoft.NETCore.App/1.0.3/System.Private.CoreLib.ni.dll", {st_mode=S_IFREG|0644, st_size=9310208, ...}) = 0
open("/opt/dotnet/shared/Microsoft.NETCore.App/1.0.3/System.Private.CoreLib.ni.dll", O_RDONLY) = 11
flock(11, LOCK_SH|LOCK_NB)              = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
lseek(11, 0, SEEK_SET)                  = 0
read(11, "MZ\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 64) = 64
lseek(11, 128, SEEK_SET)                = 128
read(11, "PE\0\0\35\375\4\0003R#X\0\0\0\0\0\0\0\0\360\0\" \v\2\v\0\0\0\0\0"..., 264) = 264
mmap(0x180220000, 9310208, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x749f2d103000
mmap(0x749f2d103000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED, 11, 0) = 0x749f2d103000
mmap(0x749f2d104000, 1527808, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 11, 0x1000) = 0x749f2d104000
mmap(0x749f2d279000, 126976, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 11, 0x176000) = 0x749f2d279000
mmap(0x749f2d298000, 7475200, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 11, 0x195000) = 0x749f2d298000
mmap(0x749f2d9b9000, 176128, PROT_READ, MAP_PRIVATE|MAP_FIXED, 11, 0x8b6000) = 0x749f2d9b9000

I'm not super familiar with the process of binding mscorlib but it seems like something is going wrong there that derails startup. We tried to not use the native image of S.P.CL, but the CLI did not allow it (sadly):

(gdb) run bin/Debug/netcoreapp1.1/essai-1.0-mvc.dll
Starting program: /usr/local/bin/dotnet bin/Debug/netcoreapp1.1/essai-1.0-mvc.dll
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Error: assembly specified in the dependencies manifest was not found -- package: 'runtime.debian.8-x64.Microsoft.NETCore.Runtime.CoreCLR', version: '1.1.1', path: 'runtimes/debian.8-x64/native/System.Private.CoreLib.ni.dll'
[Inferior 1 (process 28737) exited with code 0214]

Thank you @sebok for your help in troubleshooting this!

gkhanna79 commented 7 years ago

Looking at https://dotnet.myget.org/feed/dotnet-core/package/nuget/runtime.debian.8-x64.Microsoft.NETCore.Runtime.CoreCLR/1.1.1, the assembly is present in the intended location.

swgillespie commented 7 years ago

Yeah, the assembly is in its intended location (the open syscall above returns successfully) but apparently something goes awry during the binding of types within S.P.CL.

gkhanna79 commented 7 years ago

Which types?

swgillespie commented 7 years ago

@sebok Do you think you could help us one more time and step through LoadBaseSystemClasses the way were were above, so we can identify the exact point of failure?

ghost commented 7 years ago

I've determined that the exception happens in this call stack:

- SystemDomain::LoadBaseSystemClasses ==>see breackpoint 1
- AppDomain::LoadDomainAssemblyInternal
- AppDomain::LoadDomainFile ==> see breackpoint 2
- AppDomain::TryIncrementalLoad
- DomainFile::DoIncrementalLoad
- DomainAssembly::Allocate
- Assembly::Create ==> see breackpoint 3
- Assembly::Init
- Module::Create ==> see breackpoint 4
- Module::DoInit
- Module::Initialize
- Module::InitializeNativeImage
- Module::LoadHelperTable ==> see breackpoint 5
- rel32UsingJumpStub
- ExecutionManager::jumpStub
- ExecutionManager::getNextJumpStub
- EEJitManager::allocJumpStubBlock
- EEJitManager::allocCodeRaw
- EEJitManager::NewCodeHeap
- LoaderCodeHeap::CreateCodeHeap ==> see breackpoint 6

Breakpoint locations:

(gdb) b SystemDomain::LoadBaseSystemClasses
Breakpoint 1 at 0x7dfba395bc1f: file /home/ed/coreclr/src/vm/appdomain.cpp, line 2820.
(gdb) b AppDomain::LoadDomainFile
Breakpoint 2 at 0x7dfba39634ea: AppDomain::LoadDomainFile. (2 locations)
(gdb) b Assembly::Create
Breakpoint 3 at 0x7dfba3975f0d: file /home/ed/coreclr/src/vm/assembly.cpp, line 575.
(gdb) b Module::Create
Breakpoint 4 at 0x7dfba397fbfa: file /home/ed/coreclr/src/vm/ceeload.cpp, line 1463.
(gdb) b Module::LoadHelperTable
Breakpoint 5 at 0x7dfba37e7fb1: file /home/ed/coreclr/src/vm/jitinterface.cpp, line 12757.
(gdb) b LoaderCodeHeap::CreateCodeHeap
Breakpoint 6 at 0x7dfba37a96b5: file /home/ed/coreclr/src/vm/codeman.cpp, line 2070.

Execution:

(gdb) run bin/Debug/netcoreapp1.1/essai-1.0-mvc.dll
Starting program: /usr/local/bin/dotnet bin/Debug/netcoreapp1.1/essai-1.0-mvc.dll
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x6a811d64d700 (LWP 12562)]
[New Thread 0x6a811cd53700 (LWP 12563)]
[New Thread 0x6a8117f05700 (LWP 12564)]
[New Thread 0x6a8117651700 (LWP 12565)]
[New Thread 0x6a8116d97700 (LWP 12566)]
[New Thread 0x6a8115b54700 (LWP 12567)]
[New Thread 0x6a811532b700 (LWP 12568)]
[New Thread 0x6a8114b11700 (LWP 12569)]
[New Thread 0x6a7cba20b700 (LWP 12570)]
[New Thread 0x6a7cb9961700 (LWP 12571)]

Breakpoint 1, SystemDomain::LoadBaseSystemClasses (this=0x6a811f6f4980 <g_pSystemDomainMemory>) at /home/ed/coreclr/src/vm/appdomain.cpp:2820
2820            m_pSystemFile = PEAssembly::OpenSystem(NULL);
(gdb) c
Continuing.

Breakpoint 2, AppDomain::LoadDomainFile (this=0x131efa0, pLock=0x131a430, targetLevel=FILE_LOAD_POST_LOADLIBRARY) at /home/ed/coreclr/src/vm/appdomain.cpp:6976
6976        if(!CanLoadCode())
(gdb) c
Continuing.

Breakpoint 3, Assembly::Create (pDomain=0x6a811f6f3da0 <g_pSharedDomainMemory>, pFile=0x12e8a40, debuggerFlags=DACF_ALLOW_JIT_OPTS, fIsCollectible=0, pamTracker=0x77e22c689c40,
    pLoaderAllocator=0x0) at /home/ed/coreclr/src/vm/assembly.cpp:575
575         NewHolder<Assembly> pAssembly (new Assembly(pDomain, pFile, debuggerFlags, fIsCollectible));
(gdb) c
Continuing.

Breakpoint 4, Module::Create (pAssembly=0x131a530, moduleRef=637534208, file=0x12e8a40, pamTracker=0x77e22c689c40) at /home/ed/coreclr/src/vm/ceeload.cpp:1463
1463        if (file->HasNativeImage())
(gdb) c
Continuing.

Breakpoint 5, Module::LoadHelperTable (this=0x6a7cb8846000) at /home/ed/coreclr/src/vm/jitinterface.cpp:12757
12757       BYTE * table = (BYTE *) GetNativeImage()->GetNativeHelperTable(&tableSize);
(gdb) c
Continuing.

Breakpoint 6, LoaderCodeHeap::CreateCodeHeap (pInfo=0x77e22c689700, pJitMetaHeap=0x6a811f6f5498 <g_pSystemDomainMemory+2840>) at /home/ed/coreclr/src/vm/codeman.cpp:2070
2070        size_t   reserveSize        = pInfo->getReserveSize();
(gdb) n
2076        if (reserveSize != (DWORD) reserveSize)
(gdb)
2071        size_t   initialRequestSize = pInfo->getRequestSize();
(gdb)
2072        const BYTE *   loAddr       = pInfo->m_loAddr;
(gdb)
2073        const BYTE *   hiAddr       = pInfo->m_hiAddr;
(gdb)
2091        NewHolder<LoaderCodeHeap> pCodeHeap(new LoaderCodeHeap(pPrivatePCLBytes));
(gdb)
2094        DWORD dwSizeAcquiredFromInitialBlock = 0;
(gdb)
2097        pBaseAddr = (BYTE *)pInfo->m_pAllocator->GetCodeHeapInitialBlock(loAddr, hiAddr, (DWORD)initialRequestSize, &dwSizeAcquiredFromInitialBlock);
(gdb)
2098        if (pBaseAddr != NULL)
(gdb)
2104            if (loAddr != NULL || hiAddr != NULL)
(gdb)
2106                pBaseAddr = ClrVirtualAllocWithinRange(loAddr, hiAddr,
(gdb)

2109                if (!pBaseAddr)
(gdb)
2112                    pBaseAddr = ExecutionManager::GetEEJitManager()->AllocateFromEmergencyJumpStubReserve(loAddr, hiAddr, &reserveSize);
(gdb)
2114                        ThrowOutOfMemoryWithinRange();
(gdb)
Failed to initialize CoreCLR, HRESULT: 0x80004005
[Thread 0x6a8114b11700 (LWP 12569) exited]
[Thread 0x6a7cb9961700 (LWP 12571) exited]
[Thread 0x6a811532b700 (LWP 12568) exited]
[Thread 0x6a8115b54700 (LWP 12567) exited]
[Thread 0x6a8116d97700 (LWP 12566) exited]
[Thread 0x6a8117651700 (LWP 12565) exited]
[Thread 0x6a8117f05700 (LWP 12564) exited]
[Thread 0x6a811cd53700 (LWP 12563) exited]
[Thread 0x6a811d64d700 (LWP 12562) exited]
[Thread 0x6a8120e91740 (LWP 12561) exited]
[Inferior 1 (process 12561) exited with code 0211]
ghost commented 7 years ago

And here is an updated strace.txt now that I use CoreCLR 1.1.1. We can see that just before the exception, there is a succession of mmap and munmap calls for a total of about 2.4 GB of memory, after open("/opt/dotnet/shared/Microsoft.NETCore.App/1.1.1/System.Private.CoreLib.ni.dll", O_RDONLY).

When I set ServerGarbageCollection to false, everything works fine, the application starts. We can see in strace-ServerGC-false.txt that in this case, this succession of mmap and munmap calls after open("/opt/dotnet/shared/Microsoft.NETCore.App/1.1.1/System.Private.CoreLib.ni.dll", O_RDONLY) does not happen.

ghost commented 7 years ago

Finally, here is strace-successful.txt, for the situation when ServerGarbageCollection is NOT set to false but the application starts successfully. This succession of mmap and munmap calls does not happen.

This situation happens sometimes (I would say ~20% of the launchs), but most of the time (~80%) there is the Failed to initialize CoreCLR, HRESULT: 0x80004005 error. Note: I do not recompile the application between the launchs.

swgillespie commented 7 years ago

@sebok Thank you so much, this is extremely helpful!

@gkhanna79

So, perhaps the huge amount of maps are coming from here: https://github.com/dotnet/coreclr/blob/master/src/utilcode/util.cpp#L644 , when we are scanning for a free chunk of VM to allocate these emergency jump stubs? I'm not super familiar with the PAL or JIT to know if VirtualQuery results in calls to mmap, but if we are failing to find a free space of VM I suppose it could explain this behavior.

I can imagine why Server GC would trigger this behavior - we commit an ephemeral segment and LOH segment for every heap that we create. I can't imagine what has to transpire to run out of VM on a 64-bit process, though...

Maoni0 commented 7 years ago

@janvorli could you please take a look?

janvorli commented 7 years ago

Based on the strace I am almost sure it is due to GRSEC/PAX enabled - see dotnet/cli#8480. @sebok do you know if you have GRSEC/PAX enabled on your Linux? As a workaround, disabling for dotnet using paxctl -r dotnet where dotnet should be specified by its full path should work. The issue is that the address passed to mmap is ignored with GRSEC/PAX enabled and that prevents us to find a block of free virtual memory within a specific range.

gkhanna79 commented 7 years ago

CC @kouvel

ghost commented 7 years ago

@janvorli Indeed, I assume GRSEC/PAX is enabled:

I'll take a look at the workaround mentionned in https://github.com/dotnet/coreclr/issues/8480 and keep you updated.

ghost commented 7 years ago

So, paxctl -r /opt/dotnet/dotnet returns file /opt/dotnet/dotnet does not have a PT_PAX_FLAGS program header, try conversion. Same result on libcoreclr.so.

And I don't have any file named corerun or coreconsole.

Am I missing something?

janvorli commented 7 years ago

@sebok I've not tried to use the paxctl myself before. Looking at its options, there is an -C option that should do the conversion the -r option complains about. Can you please give it a try? I am not sure if you can pass both options at once or if you need to convert first and then run the -r.

ghost commented 7 years ago

By the way, I confirm that this kernel is configured with PAX and RANDMMAP enabled:

# zgrep -i pax /proc/config.gz
CONFIG_PAX_USERCOPY_SLABS=y
# PaX
CONFIG_PAX=y
# PaX Control
CONFIG_PAX_SOFTMODE=y
CONFIG_PAX_EI_PAX=y
CONFIG_PAX_PT_PAX_FLAGS=y
CONFIG_PAX_XATTR_PAX_FLAGS=y
# CONFIG_PAX_NO_ACL_FLAGS is not set
CONFIG_PAX_HAVE_ACL_FLAGS=y
# CONFIG_PAX_HOOK_ACL_FLAGS is not set
# CONFIG_PAX_NOEXEC is not set
# CONFIG_PAX_KERNEXEC is not set
CONFIG_PAX_KERNEXEC_PLUGIN_METHOD=""
CONFIG_PAX_ASLR=y
# CONFIG_PAX_RANDKSTACK is not set
CONFIG_PAX_RANDUSTACK=y
CONFIG_PAX_RANDMMAP=y
# CONFIG_PAX_MEMORY_SANITIZE is not set
# CONFIG_PAX_MEMORY_STACKLEAK is not set
CONFIG_PAX_MEMORY_STRUCTLEAK=y
# CONFIG_PAX_MEMORY_UDEREF is not set
# CONFIG_PAX_REFCOUNT is not set
# CONFIG_PAX_USERCOPY is not set
CONFIG_PAX_SIZE_OVERFLOW=y
# CONFIG_PAX_LATENT_ENTROPY is not set
ghost commented 7 years ago

paxctl -c -r /opt/dotnet/dotnet worked and... SOLVED THE ISSUE! :+1:

It would be great to make it unecessary as you suggest in https://github.com/dotnet/coreclr/issues/8480, or at least / in the meanwhile, provide a more detailed error message to guide users to this workaround.

Alternative: check if PAX/RANDMMAP seems enabled at first launch of dotnet CLI tools (only once, when it populates nuget cache), and warn the user if necessary.

janvorli commented 7 years ago

@sebok, thank you for the confirmation!

swgillespie commented 7 years ago

@janvorli Thank you! 😄

ghost commented 7 years ago

@janvorli, out of curiosity, how does this relate to server garbage collection? The issue never happens when server GC is disabled. And why, when server GC is enabled, does it sometimes happen and other times not?

janvorli commented 7 years ago

@sebok The reason why it sometimes works and sometimes it doesn't with the server GC enabled is that it depends on where System.Private.CoreLib.ni.dll and libcoreclr.so are located in the memory. If they end up being placed close enough so that the code in System.Private.CoreLib.ni.dll can make a short call when calling the native runtime, then we don't need to allocate jump stubs within 2GB distance from the jitted code and everything is fine. So it is just a matter of pure luck where the mmap ends up placing the requested memory range.

If you look at the failure case log, the System.Private.CoreLib.ni.dll is mapped at 0x73dffb6e4000, while the libcoreclr.so is mapped at 0x73e46d99b000, which means they are about 17GB from each other. So in the middle of loading the System.Private.CoreLib.ni.dll, you can see that we've started to scan 2GB of memory right below the System.Private.CoreLib.ni.dll from 0x73df7b870000 to 0x73e00a2b0000 in attempt to find a free slot for the jump stubs. But due to the RANDMMAP, mmap behaves as if the memory was occupied (returning a different address than the one requested) and so we don't find a slot and that's our stop station.

If you look at the success log with server gc still on, you can see that System.Private.CoreLib.ni.dll is mapped at 0x72d3a8b93000 and libcoreclr.so is mapped at 0x72d3ae563000, which is just 96MB from each other and so no jump stubs are needed.

And finally, why it "always" works with server GC disabled? Well, I think it could also sometimes fail, but rarely. The reason is that when server GC is disabled, GC reserves much smaller amount of virtual memory and so it is much more likely that the System.Private.CoreLib.ni.dll and libcoreclr.so will be close enough to each other (even if the virtual memory range reserved for GC is in between them).

ghost commented 7 years ago

@janvorli, thank you it's very clear! For others who might ask why are these jump stubs needed, I found this article quite useful: