dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

.NET 8 crash with "Fatal error. Failed to create RW mapping for RX memory" #97316

Open loop-evgeny opened 7 months ago

loop-evgeny commented 7 months ago

Description

We have many instances of our ASP.NET Core application for different customers, each running as a systemd service. One specific instance has crashed twice with "Fatal error. Failed to create RW mapping for RX memory".

The last time was on 2023-Nov-30, so this doesn't happen often. It may be because this instance of the application is particularly large, using ~300 GB RAM while loading a lot of data (once per day) and ~50 GB at other times.

Reproduction Steps

Not reproducible. Happened 2 times so far.

Expected behavior

No crash

Actual behavior

Crash with the systemd journal containing:

Fatal error. Failed to create RW mapping for RX memory
   at DynamicClass.InvokeStub_ItemCollection`1..ctor(System.Object, System.Span`1<System.Object>)
   at System.Reflection.MethodBaseInvoker.InvokeWithOneArg(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)
   at System.RuntimeType.CreateInstanceImpl(System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)
... (our code here - can provide full stack privately if needed) ...
   at Microsoft.AspNetCore.Builder.UseMiddlewareExtensions+ReflectionMiddlewareBinder+<>c__DisplayClass6_0.<CreateMiddleware>b__0(Microsoft.AspNetCore.Http.HttpContext)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol+<ProcessRequests>d__238`1[[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].MoveNext()
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol+<ProcessRequests>d__238`1[[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], Microsoft.AspNetCore.Server.Kestrel.Core, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60]].MoveNext(System.Threading.Thread)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
Main process exited, code=killed, status=6/ABRT

(I don't have the stack trace from the first crash, unfortunately.)

Regression?

No response

Known Workarounds

No response

Configuration

.NET 8.0.0 x64, self-contained application Ubuntu 22.04.3 this (second) time, Ubuntu 18.04.6 the first time Running as a systemd service The machine has 1TB RAM, but only ~36% of it was used at the time of the crash.

Other information

The only reference I can find to this error is in https://github.com/dotnet/runtime/issues/80580 The last comment there includes code to reproduce it by creating many dynamic assemblies, but that code runs successfully for me on the same machine (as well as other machines), both on .NET 8 and .NET 6. We do create some dynamic assemblies via CSharpCompilation.Emit() but I don't think we create thousands of them (but even if we did I would not expect .NET to crash like that).

loop-evgeny commented 7 months ago

I know this isn't much info to go on. I've installed systemd-coredump on the server now to try to get a crash dump next time, though I don't know if it will cope with a 300 GB dump!

loop-evgeny commented 7 months ago

@janvorli

janvorli commented 7 months ago

@loop-evgeny this error occurs when mmap fails to create a read-write mapping for an existing read-execute, which means an out of memory or out of maximum allowed number of memory mappings. Since you mentioned that you were not running in a container (like a docker) and that your machine still had a plenty of available memory, my guess is that the problem is likely exceeding of the allowed memory mappings limit. The limit can be raised using the vm.max_map_count setting. You can check your current one by cat /proc/sys/vm/max_map_count.

loop-evgeny commented 7 months ago

@janvorli I see, thanks for the explanation. Correct, we're not running in a container.

This is a rather cryptic error message, so would be nice to include in the message what you wrote here, like "Failed to create RW mapping for RX memory. This may be caused by running out of memory or out of memory mappings - check the vm.max_map_count setting on Linux or (whatever) on Windows". But it seems like we're only the second ones to run into it.

We haven't changed max_map_count, so it's at the default value of 65530. But is there a downside to increasing this? Why is there even a limit?

loop-evgeny commented 7 months ago

I checked the /proc/PID/maps file for the process and it currently has 19408 lines, which is much higher than other instances of our application, even large ones. That's under normal usage, so I think your hunch is right and during the daily data loading it may well go over 65530, but I'll add monitoring to check that.

janvorli commented 7 months ago

But is there a downside to increasing this?

There is no downside. The setting has no effect on applications that use less mappings and enables proper execution of applications that use more.

loop-evgeny commented 7 months ago

I've been monitoring the count of maps for that process for a week now and while it hasn't gone over 65K yet, it is steadily increasing. RAM usage goes up during the daily data loading (smaller than usual last week), then down again, but the number of memory map areas does not go down signficantly. It started around 19K a week ago and is now at 32K. Can there be a "leak" in that somehow - without an obvious memory leak?

loop-evgeny commented 7 months ago

We finally had a few days where memory usage went > 300 GB and the number of maps went over 80K. It then went down together with RAM usage, to ~34K. So it seems like there is no "leak" and increasing vm.max_map_count fixed this for us.

It would be good if the error message explained what the likely problem is, though. There is no way the average developer troubleshooting a crash will know that "Failed to create RW mapping for RX memory" means "Either you're out of memory or you need to increase vm.max_map_count".

janvorli commented 7 months ago

It would be good if the error message explained what the likely problem is, though. There is no way the average developer troubleshooting a crash will know that "Failed to create RW mapping for RX memory" means "Either you're out of memory or you need to increase vm.max_map_count".

@loop-evgeny thank you for the suggestion, that makes sense. When I have added that error message, I didn't realize that the max map count can be also causing the issue. I'll update the message along the lines of what you've suggested.

baal2000 commented 4 months ago

@mangod9 @janvorli

We do think there is a change with how .net 8 allocates/uses process memory. This could "by design": then the change in behavior was breaking that required a notice.

Reproducing should be fairly easy: run 2 processes (one in .net 7 and another .net 8) in a stress test environment allocating managed and unmanaged memory and recording proc/self/maps (VAS) counts, then compare if there is a significant difference. Have your team done that?

Our environment:

To illustrate the crash pattern after .net 8 upgrade and before vm.max_map_count increase: image

We will provide more information as deemed necessary and also going to do a comparison stress test run.

janvorli commented 4 months ago

@baal2000 the growth in number of memory mappings between the .net 7 and 8 is significant only if there is a large amount of managed code being generated on the fly. We have never hit this in our internal testing as it requires quite specific behavior of the application. I agree that we should document somewhere that when you experience issue like this, the vm.max_map_count needs to be updated. The value can be set to any large value, enlarging it doesn't result in any additional growth in memory consumption other than one related to the needed growth of the number of mappings required by .NET. So, you don't really need to figure out some optimal value, you can e.g. set it to 100 times the default and be good.

baal2000 commented 4 months ago

@janvorli

We have never hit this in our internal testing

Have the team profiled proc/self/maps counts? We do not necesserily need to "hit" an issue.

requires quite specific behavior of the application

Could you elaborate and point to a specific area inside the framework that now allocates differently than under the old framework?

janvorli commented 4 months ago

Have the team profiled proc/self/maps counts? We do not necesserily need to "hit" an issue.

We have not, but the high count is not necessarily a problem per se, so we had no reason to do that. And we were not aware of the relatively low default limit value, which would probably made us consider this being a problem until people reported it in this issue.

Could you elaborate and point to a specific area inside the framework that now allocates differently than under the old framework?

The write xor execute feature to prevent code memory being executable and writeable at the same time was caused the difference in the memory mapping pattern. There are several kinds of small stubs that are created for methods that are called by managed code, but were not compiled / resolved yet and for call counting to allow us dynamically re-jit methods on hot paths with more optimizations (this is called tiered compilation). The memory for these stubs is allocated as pairs of blocks of memory, one read-execute for the code of the stubs and one read-write for the writeable data of the stubs. This is what causes the large number of mappings in case of a lot of methods, because these blocks are effectively interleaved in memory, so, each of them requires a separate memory mapping. These blocks are 16kB long. So e.g. for FixupPrecode stubs, each stub is 24 bytes long, so stubs for 682 methods fit into one pair of blocks. The call counting stubs 32 bytes long so 512 method stubs fit into each block.

baal2000 commented 4 months ago

@janvorli The tiered compilation is not a new thing: is the behavior now different due to tiered PGO? I tried to turn it off and there was no difference.

janvorli commented 4 months ago

@baal2000 the write xor execute can be turned off by setting the env var DOTNET_EnableWriteXorExecute=0. As for the tiered compilation, I was talking about the stubs. They were originally allocated from continuous read-write-execute mapped memory and those blocks tended up to be coalesced in the virtual memory, thus using much smaller number of mappings.

The counts are much higher for larger heap processes than for smaller even if they do the same kinds of processing.

Then it is probably not related to the write xor execute and the stubs I was talking about. Could you share a smap of a process with a large number of mappings? That could shed some light on the problem.

baal2000 commented 4 months ago

@janvorli

Could you share a smap of a process with a large number of mappings

/proc/PID/smaps file?

janvorli commented 4 months ago

Yes, please. Please feel free to trim any filenames from it in case they are sensitive.

baal2000 commented 4 months ago

@janvorli

DOTNET_EnableWriteXorExecute=0 didn't make any difference therefore it is not related to write xor execute.

On another hand, as mentioned in my first message due to .net 7 GC segfault issue we configured .net 7 with COMPlus_GCName=libclrgc.so. I just tried the same to .net 8 process and the result was lower, stable maps count: image

Update: We have provided /proc/PID/smaps files for both regular .net 8 and .net 8 + .net 6 GC processes that match the count differences above.

janvorli commented 3 months ago

In the meanwhile, @baal2000 has shared with me some smaps / maps logs. I was surprised to see that there are many mappings that are adjacent in the virtual address space, that have the same protection and all flags and yet the kernel has not merged them. All of them were multiple of 4MB large. @baal2000 also provided logs from running with the libclrgc.so set as the GC. Those didn't have this problem and the number of mappings was much smaller. It turns out that 4MB is the default size of a GC region. GC regions were newly introduced in .NET 8 and the libclrgc.so has the same GC, but with the regions disabled. I have tried to repro the issue locally by mimicking how GC reserves and commits virtual memory. It first maps a very large area of virtual memory with PROT_NONE protection and then as is needs more memory, it changes a protection of memory pages in this range to PROT_READ | PROT_WRITE. First, I was unable to repro it, but then @cshung reminded me that when server GC is used, each CPU in the process would have its own set of GC regions. In @baal2000 case there were 96 CPUs, so there were very many regions. When I have modified my simple C repro app to run the protection changing on multiple threads, assigning each thread a 4MB memory range and changing a protection of the pages in it page by page AND touching the content of those pages, I was able to reproduce the issue. The problem seems to be that there is a race condition in the Linux kernel that prevents merging of adjacent mappings when two threads are changing their protection at the same time. In my repro, the merging at the boundary of the 4MB regions sometimes succeeded and sometimes it didn't. So, I could see adjacent blocks of 4MB, 8MB, 12MB and sometimes even larger multiples of 4MB. I don't know if that's fixable in the Linux kernel without introducing performance issues due to some extra synchronization that would probably be needed there. I don't also see a reasonable way for working around that in the .NET GC without a performance hit. In @baal2000 case, @cshung has suggested trying to set the default GC region size to 16MB, which would reduce the number of adjacent mappings that can possibly get created by 4 times. And that really helped, @baal2000 has reported that the number of mappings was no more growing until the OOM, but rather got capped slightly below 20000. I believe that just bumping the max number of mappings to 4 times that value (just a guess, might need more) would also let it be stable. Even with having more mappings, I would not expect observable performance difference in the app performance, as Linux kernel would likely use some O(log(n)) algorithm to look up the mappings.

janvorli commented 3 months ago

I've forgotten to mention how to set the GC region size to 16MB. Setting the DOTNET_GCRegionSize env variable to 1000000 does the trick. The 1000000 is 16MB in hexadecimal. It can be also prefixed by 0x, it doesn't matter though as it is always hexadecimal.

loop-evgeny commented 3 months ago

Is increasing GC region size worth doing for us as well? We haven't hit the issue since increasing vm.max_map_count.

janvorli commented 3 months ago

@loop-evgeny it would be best if you'd tried that with your app and based your decision on the real world perf results of it. We don't have any data on performance differences with different GC region sizes. My expectations would be that there would be no measurable difference, however it is always better to measure using the specific scenario to see what works best for you. And if you try that, please let us know about the results, as many people would benefit from that.

baal2000 commented 3 months ago

@loop-evgeny As @janvorli said, try for yourself: this is a safe change to make. We have tried the next possible GC region size of 8MB DOTNET_GCRegionSize=0x1000000 on a 600 GB/96 cores single process VM and the RAM usage with the GC heap patterns have changed: image

Update: 1 week later It appears that the initial apparent difference was due to some initial conditions, but over longer period the RAM usage pattern looks similar for 8MB GC region vs. the default 4 MB, with the maps count much lower at under 50K vs. 90K for the defaulot region size:
image

cshung commented 3 months ago

At the end of the day, the Linux kernel is responsible for creating and merging these virtual memory mappings. As such, I did a study on the Linux kernel to see how it works.

As part of the study, I wrote some notes on what I have found. In short, I think there is a missed opportunity there such that the Linux kernel can do a better job in our scenario.

https://cshung.github.io/posts/linux-virtual-memory-mapping-debugging/

baal2000 commented 3 months ago

@cshung Great work on getting to the root cause and the blog.

Is there a plan to continue full support for non-region based (libclrgc.so) GC in future versions of .NET?

cshung commented 3 months ago

In the foreseeable future, we still need segments for 32 bit platforms, so we will continue to release libclrgc.so. However, all the new innovations that we will do in the future will target regions, so you will miss out all the new stuff.

baal2000 commented 2 months ago

@cshung @janvorli

After thinking a bit about your findings: I feel that we should not lay all the blame at Linux's feet. If the region-based GC becomes too aggressive in how the process commits/decommits memory regions from the underlying OS kernel. Eventually it may hit a limit on how quickly the kernel can process these efficiently. In other words, we can not push own perf issues to the level below and expect not to receive some sort of a back pressure.

This issue should be followed up by a more practical step of "what we could do" and not "what they (Linux) should". For instance, deciding on a different default region size (currently standing at 4MB), or some other parts of the GC implementation.

janvorli commented 2 months ago

This issue should be followed up by a more practical step of "what we could do" and not "what they (Linux) should". For instance, deciding on a different default region size (currently standing at 4MB), or some other parts of the GC implementation.

I agree that we should try to do as much as we can, since even if a fix got into the Linux kernel now, it would take time to become mainstream. And since there was a Linux kernel patch that was trying to fix that problem in the past and it was rejected for reasonable reasons, I am a bit skeptical about it being changed in a foreseeable future.

We already have plans to run some performance benchmarks with different region sizes to see if it has any perf impact. I have also investigated a possibility of different pattern of accessing the memory to lower the probability of kernel will fail to merge the blocks. After we know the influence of the region sizes on the perf, we could decide to automatically pick the region size based e.g. on the total amount of available memory.

However, I still haven't seen evidence that the large number of mappings that can result from the current state cause performance problems. The mapping count limit can be raised to a larger value to mitigate the OOMs. The default limit is quite conservative.

I don't actually feel like we are aggressively committing / decommitting memory. We are just doing that on demand of the application - the more it allocates, the more we need to commit.

baal2000 commented 2 months ago

@janvorli thanks for the work and the progress your team is making to address this.

I saw the updated error message https://github.com/dotnet/runtime/pull/102458 but that could be too late to find out in the app development cycle. Other times the process crashes with no message at all other than "General Protection Fault" in kernel logs or a simple "out of memory" message, not sure why. These are all related though because none happens once the limit is raised in the OS. An upfront recommendation in the documentation could be in order: this garbage collection fundamentals or some other place.

The default limit is quite conservative.

Could you propose the change in Linux kernel repo? If there are no downside it is going to be merged in no time and the issue could be closed.

cshung commented 2 months ago

Thanks for following up!

I feel that we should not lay all the blame at Linux's feet.

Just to reiterate, I said "I think there is a missed opportunity there for Linux to improve", I am not blaming them. This is probably just a scenario they never envision, and we should enlighten them on that.

If the region-based GC becomes too aggressive in how the process commits/decommits memory regions from the underlying OS kernel.

The region-based GC isn't aggressive in terms of volume (we are not committing too much memory over what the application need), nor it is aggressive in terms of frequency (we have optimization in place to avoid frequent commit/decommit calls). It is the random sequence of committing that is choking the underlying OS.

Typical stack grows one way. Typical malloc implemented using brk grows one way.

To allow various optimizations, regions works by allowing multiple regions to grow independently, so think of that as many streams going on and they all go one way towards the right. Think of that like an old school parallel download app.

image

At the end of the download, you expect one region, but no, on Linux, these stream don't merges and they have as many as the number of streams ever created.

This story starts with having 5N streams, where N is the number of cores and 5 is the number of generations. Once any of these streams ends, we create new ones. But then, by the time every of the initial 5N stream ends, all the earlier gaps should have been filled, and therefore we expect the number of streamed parts should stay roughly 5N to 10N.

But the number we observed from the maps is a 100 fold more than that, and that is because when end meets, Linux cannot merge them. So it ends up being the total number of streams created so far, these accumulate over times and therefore we have this issue.

All these have a caveat, that is based on my research on the Linux code base for just a few days, it might very well be wrong.

This issue should be followed up by a more practical step of "what we could do" and not "what they (Linux) should".

I agree with that, but retrofitting regions to fit the Linux way of one way committing is simply not an option. There are already various workaround we proposed in this thread, here they are (in preference order):

  1. Bump the system limit of mapping counts
  2. Adjust the region size
  3. Use libclrgc.so

And all of these do not require a change in the runtime code, they are all configurable.

Except 3 (which actually grow the memory one way), both 1 and 2 didn't address the underlying issue the end of streams doesn't merge, all it does is either:

  1. Accept more ranges, so it is fine (up to what a typical user needs anyway), or
  2. Create less ranges, so it stays within limit.

Eventually, with bigger apps, these limit will hit again. IMO, this is just fixing the symptoms.

What we really want to understand is the consequence of 1 and 2.

We suspect 1 might impact memory access latencies, but do we have data? The dependency of memory access latency on number of memory mapping should be logarithmic, so I expect even if we 4x the number of memory mappings, we will have at most two more memory accesses for each page fault. In the grand theme of things does that matter at all?

We experimented with 2, and there is a GC behavioral change. As of now, this is still mysterious to us, do we have data where we can use the analyze what is going on?

baal2000 commented 2 months ago

@cshung

It is the random sequence of committing that is choking the underlying OS.

This nails it, thank you.

Think of that like an old school parallel download app

Interesting comparison, with the only difference is that with the partial download streams "we" create the streams and "we" do the re-assembly at the end. In the regions GC scenario "we" do the distributed allocations to achieve better throughput, yet have no control over the vm maps re-assembly. Not saying this is wrong: this is to agree with the statement that this hasn't been expected and modeled.

  1. Bump the system limit of mapping counts

The sysctl documentation does not say much about the max_map_count purpose other than stating what the value is and the assumption it is sufficient for standard use scenarios.

RedHat claims that the limit has something to do to letting kernel more access to its lowmem: The upside of lowering this limit is that it can free up lowmem for other kernel uses.

Found this explanation in the original linux kernel repo: it refers to the coredump as the reason. Is this relevant for debian and other Linux flavors today?

Nowhere can I find a word about the memory access performance though.

Maybe a prudent thing to do for now would be documenting this similarly to Elasticsearch recommendation of max_map_count at least 262144 to prevent out-of-memory exceptions: elastic.co.

This is the value we also picked for our servers to stop the incidents.

Not pretending for the best wording, but the spirit of the message could be:

Since .NET 7 introduced regions-based GC, large memory, large CPU core count applications consuming 100GBs of RAM running on in Linux could run into the default operating system limit on mmap counts, which may result in out of memory crashes. You can increase the limits by ... (TBD)

cshung commented 2 months ago

RedHat claims that the limit has something to do to letting kernel more access to its lowmem: The upside of lowering this limit is that it can free up lowmem for other kernel uses.

Just as an FYI - this might have to do with non-paged pool.

Just like user mode, kernel mode can also use virtual memory. However, with virtual memory, you can have a block of memory in contiguous virtual addresses that is not contiguous in physical addresses, this upset direct memory access (DMA) for devices like secondary storage or network cards.

Therefore the kernel has a particular pool of memory that is restricted so that it can guarantee you have contiguous physical memory. This make it possible for use of DMA.

Because the virtual memory area are used to handle page fault, they probably need to be stored in that area too because you don't want to handle page fault with memory that can also result in a page fault.

That non-paged pool is probably a precious resource of its own, that is probably why we had the conservative limit.

baal2000 commented 2 months ago

Also an FYI we are not alone: a similar issue reported for with ZGC in JVM on SO forum. They do provide an early warning about the imminent max maps value overflow:

The system limit on number of memory mappings per process might be too low for the given max Java heap size (NNNN). Please adjust /proc/sys/vm/max_map_count ...

MichalPetryka commented 2 months ago

Also an FYI we are not alone: a similar issue reported for with ZGC in JVM on SO forum. They do provide an early warning about the imminent max maps value overflow:

The system limit on number of memory mappings per process might be too low for the given max Java heap size (NNNN). Please adjust /proc/sys/vm/max_map_count ...

FYI it's best to avoid any references to JVM and such here due to its copyleft licensing.

baal2000 commented 2 months ago

copyleft licensing

... can't apply to one forum user quoting another forum user's error message. But thanks for the reminder.

Now steering back to what is the main topic here, i.e. unexpected application crush under region-based GC.

janvorli commented 1 month ago

I am moving this issue to "future" as there is nothing we can do for .NET 9 and the main topic of this issue has become a discussion on the memory mapping merging.