.NET Core 2.0.3 runs OOM in docker

tmds commented 7 years ago

As reported by @emanuelbalea here: https://github.com/dotnet/coreclr/issues/13489#issuecomment-343416765

jkotas commented 7 years ago

cc @swgillespie

janvorli commented 7 years ago

@emanuelbalea, I was wondering if you have really tried to use the runtime 2.0.3. SDK (CLI) 2.0.3 is a completely unrelated thing that still contains runtime 2.0.0. Could you please share where did you get the coreclr stuff you were using from?

tmds commented 7 years ago

@janvorli , @emanuelbalea said he was using the latest docker nightly (https://github.com/dotnet/coreclr/issues/13489#issuecomment-343390478). And indeed, this one does use the 2.0.0 runtime:

$ docker run -ti microsoft/dotnet-nightly ls /usr/share/dotnet/shared/Microsoft.NETCore.App
2.0.0

@emanuelbalea is this the image tag you are using? I'm not sure if there is an image that has a patched runtime. Perhaps you can try one of the 2.1 tags? Or create an image yourself.

emanuelbalea commented 7 years ago

@janvorli and @tmds you are right the docker image might not be 2.0.3... Sorry about that I got confused by the numbering scheme, thought it was back in line with the clr version. I will post the tag number as soon as I get to work and I will try the 2.1 and if that fails create my own image. Thanks for all the help and will update in a couple of hours.

emanuelbalea commented 7 years ago

Update. The nightly docker images are on 2.0.0 even the preview ones. Made myself a new images based on those an will update in a few hours.

emanuelbalea commented 7 years ago

Using the latest nightly of 2.0.4 it works as expected inside custom docker image and ec2 container service in my dev environment. @tmds feel free to close this. Thanks for help :)

tmds commented 7 years ago

@emanuelbalea no problem. It was a good verification to see the OOM with the 2.0.0 runtime and 2.0.4 no longer going OOM.

tmds commented 7 years ago

@janvorli I'm trying to verify docker containers won't crash due to OOM conditions. I'm doing this as follows: Program.cs

using System;
using System.Collections.Generic;

namespace oom
{
    class Program
    {
        static void Main(string[] args)
        {
            var list = new List<byte[]>();
            int i = 0;
            while (true)
            {
                try
                {
                    System.Console.WriteLine(i++);
                    var buffer = CreateBuffer();
                    list.Add(buffer);
                }
                catch (Exception e)
                {
                    System.Console.WriteLine(e.Message);
                    return;
                }
            }
        }

        static byte[] CreateBuffer()
        {
            var buffer = new byte[1024 * 1024]; // 1Mb
            for (int j = 0; j < buffer.Length; j++)
            {
                buffer[j] = 1;
            }
            return buffer;
        }
    }
}

oom.csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>netcoreapp2.0</TargetFramework>
  </PropertyGroup>

</Project>

Dockerfile

FROM microsoft/dotnet:latest
COPY . /root
WORKDIR /root
RUN dotnet build
CMD dotnet bin/Debug/netcoreapp2.0/oom.dll

$ docker build -t oom .
$ docker run --rm --memory 10m oom

This process gets killed (after allocating some 10ish buffers) due to out of memory, as shown in dmesg:

[ 3222.697298] Memory cgroup out of memory: Kill process 20943 (dotnet) score 864 or sacrifice child
[ 3222.697322] Killed process 20943 (dotnet) total-vm:2614476kB, anon-rss:7260kB, file-rss:0kB, shmem-rss:0kB

Shouldn't it throw OutOfMemoryException instead?

tmds commented 7 years ago

The full dmesg - on kill -:

[ 3690.531593] dotnet invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=0
[ 3690.531595] dotnet cpuset=docker-fddf2d644b90997b98de0d05a84a733e4dd0bdd78a3d8f53ade28b2ddf0e601b.scope mems_allowed=0
[ 3690.531600] CPU: 12 PID: 21575 Comm: dotnet Not tainted 4.11.8-200.fc25.x86_64 dotnet/coreclr#1
[ 3690.531600] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350M, BIOS P1.20 03/06/2017
[ 3690.531601] Call Trace:
[ 3690.531606]  dump_stack+0x63/0x86
[ 3690.531608]  dump_header+0x97/0x213
[ 3690.531610]  ? mem_cgroup_scan_tasks+0xc4/0xf0
[ 3690.531612]  oom_kill_process+0x1ff/0x3d0
[ 3690.531614]  out_of_memory+0x140/0x4e0
[ 3690.531615]  mem_cgroup_out_of_memory+0x4b/0x80
[ 3690.531616]  mem_cgroup_oom_synchronize+0x329/0x340
[ 3690.531618]  ? get_mem_cgroup_from_mm+0xa0/0xa0
[ 3690.531619]  pagefault_out_of_memory+0x36/0x80
[ 3690.531621]  mm_fault_error+0x8f/0x190
[ 3690.531622]  __do_page_fault+0x4ad/0x4e0
[ 3690.531623]  do_page_fault+0x30/0x80
[ 3690.531624]  ? do_syscall_64+0x16d/0x180
[ 3690.531626]  page_fault+0x28/0x30
[ 3690.531628] RIP: 0033:0x7fa8547306e0
[ 3690.531628] RSP: 002b:00007ffd08b0c0e8 EFLAGS: 00010246
[ 3690.531629] RAX: 00007fa854dc4190 RBX: 00007fa854d9cfc8 RCX: 0000000000000001
[ 3690.531629] RDX: 00007fa854da7c01 RSI: 00007fa8565b3050 RDI: 0000000000000003
[ 3690.531630] RBP: 00007ffd08b0c100 R08: 00007fa854da9140 R09: 00007fa7b3ffe000
[ 3690.531630] R10: 0000000000000000 R11: 0000000000000206 R12: 00007fa854d9dce0
[ 3690.531631] R13: 0000000000000000 R14: 0000000001465e00 R15: 0000000000000000
[ 3690.531632] Task in /system.slice/docker-fddf2d644b90997b98de0d05a84a733e4dd0bdd78a3d8f53ade28b2ddf0e601b.scope killed as a result of limit of /system.slice/docker-fddf2d644b90997b98de0d05a84a733e4dd0bdd78a3d8f53ade28b2ddf0e601b.scope
[ 3690.531635] memory: usage 10124kB, limit 10240kB, failcnt 107700
[ 3690.531636] memory+swap: usage 20480kB, limit 20480kB, failcnt 110227
[ 3690.531636] kmem: usage 2660kB, limit 9007199254740988kB, failcnt 0
[ 3690.531637] Memory cgroup stats for /system.slice/docker-fddf2d644b90997b98de0d05a84a733e4dd0bdd78a3d8f53ade28b2ddf0e601b.scope: cache:72KB rss:7392KB rss_huge:0KB mapped_file:12KB dirty:0KB writeback:0KB swap:10356KB inactive_anon:3736KB active_anon:3660KB inactive_file:0KB active_file:0KB unevictable:4KB
[ 3690.531643] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 3690.531755] [21534]     0 21534     1072        0       8       3       25             0 sh
[ 3690.531756] [21575]     0 21575   653619     1574      73       5     2774             0 dotnet
[ 3690.531757] Memory cgroup out of memory: Kill process 21575 (dotnet) score 864 or sacrifice child
[ 3690.531767] Killed process 21575 (dotnet) total-vm:2614476kB, anon-rss:6296kB, file-rss:0kB, shmem-rss:0kB
[ 3690.532948] oom_reaper: reaped process 21575 (dotnet), now anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[ 3690.605435] docker0: port 1(vethfa04dda) entered disabled state
[ 3690.605512] vetha1f40da: renamed from eth0
[ 3690.658201] docker0: port 1(vethfa04dda) entered disabled state
[ 3690.660479] device vethfa04dda left promiscuous mode
[ 3690.660492] docker0: port 1(vethfa04dda) entered disabled state
[ 3690.753054] XFS (dm-4): Unmounting Filesystem

swgillespie commented 7 years ago

@tmds The GC will generally only throw an OutOfMemoryException if native allocations fail. This is troublesome on Linux because of VM overcommit; while it doesn't affect committing heap segments, the GC does occasionally use operator new to resize some of its own data structures (and operator new is used pervasively throughout the rest of the runtime), and those allocations often fail on Windows when running low on physical memory. With overcommit, the allocations succeed but page fault like this on first access, which gets us killed by the OOM killer and we have no chance to throw an OutOfMemoryException.

I looked into this for a while (since some GC functional tests were getting repeatedly killed by the OOM killer instead of failing in a predictable way) and I didn't find a good solution. Disabling the OOM killer entirely is bad because the kernel will simply refuse to schedule the memory-heavy process, so we'll never get the processor time to actually do a GC.

This problem isn't unique to .NET, JVMs also don't always get a chance to throw java.lang.OutOfMemoryErrors before they get reaped by the kernel. I don't think that managed environments like Java or .NET have the ability to guarantee that an OOM exception will be thrown before the OOM killer kicks in.

tmds commented 7 years ago

@swgillespie thank you, that is very interesting to know. Do you know a test I can use to validate the runtime is taking into account the docker memory limit?

swgillespie commented 7 years ago

I'm not sure how tracing in containers works (@brianrob would know) but if you can collect a trace and then view it with PerfView (https://github.com/Microsoft/perfview), you should see the GC aggressively compacting the heap as it approaches the Docker memory limit. You could also use a debugger and set a breakpoint on gc_heap::get_memory_info to see if the GC is returning the memory limits imposed by our current cgroup.

brianrob commented 7 years ago

@tmds, LTTng-UST should work inside of a container with the default seccomp profile so you should be able to collect a trace of the GC behavior.

Probably the easiest thing to do is to follow the instructions at https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/linux-performance-tracing.md#collecting-in-a-docker-container, which should make it possible to use the standard non-container workflow once you have a privileged shell (assuming you can get one).

thoean commented 6 years ago

We're running into OOM with 2.0.0. We just upgraded to 2.0.3 and see if that fixes the problem.

@emanuelbalea / @tmds , you mentioned the nightly 2.0.4 docker image will fix this problem. Where can I find it? The best I could found was microsoft/aspnetcore-nightly, but that only contains 2.0.1. microsoft/aspnetcore already contains 2.0.3. Thank you for hints or more details on what specific docker tag you used.

tmds commented 6 years ago

@thoean since this issue was created 2.0.3 has been released. so the official images at https://hub.docker.com/r/microsoft/dotnet/ contain the fix.

thoean commented 6 years ago

Thanks @tmds. Upgrading to the 2.0.3 docker image seems to have fixed the problem on our side. Thank you.

Should this issue be closed?

tmds commented 6 years ago

@swgillespie I wonder, are there minimal size requirements for the runtime to establish heaps? It would be meaningful to have some guidelines. For example: if I create a docker container with server gc and it has 4 logical cpus, how much memory should I at least allocate to that? What happens when the runtime doesn't find space to create/enlarge the heap? Does this cause an OOM kill? Or does the application exit with some sort of error?

swgillespie commented 6 years ago

@tmds The GC will commit the ephemeral segment on startup, so for server GC with four logical CPUs you can figure that you'll have at minimum four ephemeral segments resident. The size of this varies a little based on processor topology (in particular, L1 cache size) but the defaults are (from here: https://docs.microsoft.com/en-us/dotnet/standard/garbage-collection/fundamentals)

	32-bit	64-bit
Workstation GC	16 MB	256 MB
Server GC	64 MB	4 GB
Server GC with > 4 logical CPUs	32 MB	2 GB
Server GC with > 8 logical CPUs	16 MB	1 GB

What happens when the runtime doesn't find space to create/enlarge the heap? Does this cause an OOM kill? Or does the application exit with some sort of error?

If we fail to commit a heap segment (or part of a heap segment), we'll throw an OutOfMemoryException. If we successfully commit a heap segment but fail to bring a faulted page into residence due to OOM conditions on the machine/container, we run the risk of getting killed by the OOM killer.

tmds commented 6 years ago

@swgillespie Thanks for taking time to explain these things.

When I start a 100MB container on a 64-bit system, it doesn't crash. So the runtime is not actually trying to ensure those amounts of memory are available.

I've been looking a bit in gc.cpp, one function caught my attention because it is using GetPhysicalMemoryLimit:

// Get the max gen0 heap size, making sure it conforms.
size_t GCHeap::GetValidGen0MaxSize(size_t seg_size)
{
    size_t gen0size = static_cast<size_t>(GCConfig::GetGen0Size());

    if ((gen0size == 0) || !g_theGCHeap->IsValidGen0MaxSize(gen0size))
    {
#ifdef SERVER_GC
        // performance data seems to indicate halving the size results
        // in optimal perf.  Ask for adjusted gen0 size.
        gen0size = max(GCToOSInterface::GetLargestOnDieCacheSize(FALSE)/GCToOSInterface::GetLogicalCpuCount(),(256*1024));

        // if gen0 size is too large given the available memory, reduce it.
        // Get true cache size, as we don't want to reduce below this.
        size_t trueSize = max(GCToOSInterface::GetLargestOnDieCacheSize(TRUE)/GCToOSInterface::GetLogicalCpuCount(),(256*1024));
        dprintf (2, ("cache: %Id-%Id, cpu: %Id", 
            GCToOSInterface::GetLargestOnDieCacheSize(FALSE),
            GCToOSInterface::GetLargestOnDieCacheSize(TRUE),
            GCToOSInterface::GetLogicalCpuCount()));

        // if the total min GC across heaps will exceed 1/6th of available memory,
        // then reduce the min GC size until it either fits or has been reduced to cache size.
        while ((gen0size * gc_heap::n_heaps) > GCToOSInterface::GetPhysicalMemoryLimit() / 6)
        {
            gen0size = gen0size / 2;
            if (gen0size <= trueSize)
            {
                gen0size = trueSize;
                break;
            }
        }
#else //SERVER_GC
        gen0size = max((4*GCToOSInterface::GetLargestOnDieCacheSize(TRUE)/5),(256*1024));
#endif //SERVER_GC
    }

    // Generation 0 must never be more than 1/2 the segment size.
    if (gen0size >= (seg_size / 2))
        gen0size = seg_size / 2;

    return (gen0size);
}

There are two things I find interesting here:

Workstation GC isn't using GetPhysicalMemoryLimit
GetLargestOnDieCacheSize may out-weigh GetPhysicalMemoryLimit. This can perhaps happen for a tiny container (1 CPU) on a large machine (large CPU Cache).

swgillespie commented 6 years ago

@tmds The runtime reserves (in the virtual memory sense) that amount of memory on startup. Linux is happy to hand out 4GB of virtual address space on startup even if your container has a 100MB resident memory limit; it'll only complain when your resident set starts bumping up against 100MB.

It is really interesting to me that workstation GC doesn't ever look at GetPhysicalMemoryLimit, though; that sounds like something we'd like to do. @Maoni0 do you have any thoughts on this? Gen0 won't be larger than 256k in this case, but I do think we'd like to avoid situations where the ephemeral generations don't fit at all within our memory limit.

tmds commented 6 years ago

GetLargestOnDieCacheSize may out-weigh GetPhysicalMemoryLimit. This can perhaps happen for a tiny container (1 CPU) on a large machine (large CPU Cache).

@swgillespie thoughts on this?

swgillespie commented 6 years ago

@tmds That's what I'm saying here:

Gen0 won't be larger than 256k in this case, but I do think we'd like to avoid situations where the ephemeral generations don't fit at all within our memory limit.

tmds commented 6 years ago

What are typical values of GetLargestOnDieCacheSize(TRUE/FALSE) on a higher-end server?

tmds commented 6 years ago

What are typical values of GetLargestOnDieCacheSize(TRUE/FALSE) on a higher-end server?

e.g. if this is 45MB, then a container with 1 CPU (GetLogicalCpuCount) and 100MB (GetPhysicalMemoryLimit) will have Workstation gen0 of 36MB and Server gen0 of 22.5MB.

swgillespie commented 6 years ago

@tmds I just tried it on the beefiest machine I could find and got 30MB for GetLargestOnDieCacheSize(TRUE). Also I'm realizing that I typo'd 256k above, I'm pretty sure the above units are megabytes, so max of 256mb (which jives with the table in one of my earlier comments).

tmds commented 6 years ago

I'm pretty sure the above units are megabytes

I'm not sure, I think GetPhysicalMemoryLimit is a value in bytes.

which jives with the table in one of my earlier comments

I don't think they are related. The table values are in the INITIAL_ALLOC and LHEAP_ALLOC defines which get adjusted for processor count in get_valid_segment_size.

swgillespie commented 6 years ago

Yeah, I don't know for sure. At any rate, I do think it's weird to not look at the physical memory limit at all when using workstation GC; Maoni probably has some thoughts too on that.

tmds commented 6 years ago

@Maoni0 can you please take a look at this: https://github.com/dotnet/coreclr/issues/14991#issuecomment-348428003?

tmds commented 6 years ago

@swgillespie can you please ping @Maoni0 to take a look at this issue?

swgillespie commented 6 years ago

@tmds Maoni is currently out of the office; she'll be back in about a week.

tmds commented 6 years ago

@Maoni0, can you please take a look at https://github.com/dotnet/coreclr/issues/14991#issuecomment-348428003?

Maoni0 commented 6 years ago

@tmds Sorry I was out for a long time end of last year and missed some conversations.

I believe when people added the check for physical mem limit, they were doing tuning for server workloads and generally server machines would have much larger caches than typical client machines; and workstation GC generally also did GCs more frequently. so it was sufficient to add this only for server GC.

I don't see any reason why we shouldn't be checking for physical mem limit for workstation GC if we have configurations that warrant it. feel free to propose a change.

tmds commented 6 years ago

PR https://github.com/dotnet/coreclr/pull/15975 makes some changes based on https://github.com/dotnet/coreclr/issues/14991#issuecomment-348428003 Closing: there are no reported OOM for 2.0.3 and higher.

dotnet / runtime

.NET Core 2.0.3 runs OOM in docker #9261