dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.82k stars 4.61k forks source link

Application crash when reaching vm.map_count limit #90230

Closed ayende closed 4 weeks ago

ayende commented 1 year ago

Description

We are tracking what looks like a memory / fragmentation leak, see here: https://github.com/dotnet/runtime/issues/89776

As a result of that, we run into the limit of vm.map_count, see:

sudo cat /proc/$(pidof Raven.Server)/maps | wc -l
65406

This was set to 65535, and we got several crashes from the finalizer.

Aug 06 19:56:28 vm9e618664fb audit[32018]: ANOM_ABEND auid=4294967295 uid=1001 gid=1001 ses=4294967295 pid=32018 comm=2E4E45542046696E616C697A6572 exe="/ravendb/RavenDB/Server/Raven.Server" sig=11 res=1
Aug 06 19:56:28 vm9e618664fb kernel: .NET Finalizer[32047]: segfault at 440 ip 00007fa6d3f8cb62 sp 00007fa6d0439ff8 error 6 in libc-2.27.so[7fa6d3dfe000+1e7000]
Aug 06 19:56:28 vm9e618664fb kernel: Code: 1c 26 00 0f 87 07 01 00 00 c5 fe 6f 06 c5 fe 6f 4e 20 c5 fe 6f 56 40 c5 fe 6f 5e 60 48 81 c6 80 00 00 00 48 81 ea 80 00 00 00 <c5> fd 7f 07 c5 fd 7f 4f 20 c5 fd 7f 57 40 c5 fd 7f 5f 60 48 81 c7
Aug 06 19:56:29 vm9e618664fb systemd[1]: ravendb.service: Main process exited, code=killed, status=11/SEGV
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]: Fatal error. The RW block to unmap was not found
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Sparrow.Server.Platform.PalHelper.ThrowLastError(FailCodes, Int32, System.String)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.Paging.RvnMemoryMapPager.AllocateMorePages(Int64)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.Scratch.ScratchBufferFile.Allocate(Voron.Impl.LowLevelTransaction, Int32, Int32)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.Scratch.ScratchBufferPool.Allocate(Voron.Impl.LowLevelTransaction, Int32)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.LowLevelTransaction.AllocatePage(Int32, Int64, System.Nullable`1<Voron.Page>, Boolean)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.LowLevelTransaction.AllocatePage(Int32, System.Nullable`1<Int64>, System.Nullable`1<Voron.Page>, Boolean)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.LowLevelTransaction.AllocateOverflowRawPage(Int64, Int32 ByRef, System.Nullable`1<Int64>, System.Nullable`1<Voron.Page>, Boolean)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Data.BTrees.Tree+StreamToPageWriter.AllocateNextPage()
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Data.BTrees.Tree+StreamToPageWriter.Write(System.IO.Stream)

I believe that this is related to this: https://github.com/dotnet/runtime/issues/80580

Reproduction Steps

Run for a long while under load, and the number of mapping will increase until you reach the limit

Expected behavior

Should not crash

Actual behavior

It crashes

Regression?

Did not see that in .NET 6.0

Known Workarounds

Increase vm.map_count to a very high number

Configuration

Linux, x64, Ubuntu, .NET 7.0

Other information

The stack trace is really strange, we are attempting to allocate memory and then fail. That is a handled exception, but we are dying with segmentation fault.

Given that this is thrown from executableallocator, I wonder if this is possibly trying to JIT the method or maybe tier it up, and then failing.

Note that it is possible for unmap to fail in Linux (if it will increase the mapping amount. This should probably be handled without killing the process.

mangod9 commented 1 year ago

Assume this issue doesnt repro with W^X disabled? Is there a dump which can be shared to debug further? Appears the original mapping return code issue is fixed in 8 and might be ported to 7 too. Are you testing on the latest servicing release?

ayende commented 1 year ago

I'm afraid that we don't have a dump, only those logs. This is a production instance, so we jump bumped the map_count to alleviate the issue. @gregolsky - can you answer regarding the W^X and the service release?

gregolsky commented 1 year ago

Runtime version is 7.0.8. AFAIK WriteXorExecute is enabled there by default? I'm afraid we cannot do any experiments on how it works when it's disabled on the system in question.

We can try to repro on another one though artificially reducing the number of available maps.

janvorli commented 1 year ago

Is it possible that you have growing number of dynamically created assemblies like in the https://github.com/dotnet/runtime/issues/80580#issuecomment-1410942438?

ayende commented 1 year ago

Not likely, we aren't really generating many new assemblies on the fly, and not at all in the scenario we tested

janvorli commented 1 year ago

@ayende re-reading the issue description, I am not sure I understand this:

The stack trace is really strange, we are attempting to allocate memory and then fail. That is a handled exception, but we are dying with segmentation fault.

The failure to map memory as RW in the executable allocator is fatal fail fast. It is not an exception. The stack trace is a stack trace of the managed code at the time the fail fast happened. At that point, we don't have any other option than to fail fast. There are more than a hundred of places all over the source base when we need to modify or write executable code and there is no way to recover from that at majority of the places.

The fact that you get a sigsegv after the fail fast message is printed is strange though. I wonder if that could be related to the wrong checks for mmap return value that were fixed in .NET 8 (#77952, #78069), but not ported to .NET 7. It is actually quite possible it is the case.

gregolsky commented 1 year ago

Would it be possible to port these to 7?

wt., 15 sie 2023, 22:47 użytkownik Jan Vorlicek @.***> napisał:

@ayende https://github.com/ayende re-reading the issue description, I am not sure I understand this:

The stack trace is really strange, we are attempting to allocate memory and then fail. That is a handled exception, but we are dying with segmentation fault.

The failure to map memory as RW in the executable allocator is fatal fail fast. It is not an exception. The stack trace is a stack trace of the managed code at the time the fail fast happened. At that point, we don't have any other option than to fail fast. There are more than a hundred of places all over the source base when we need to modify or write executable code and there is no way to recover from that at majority of the places.

The fact that you get a sigsegv after the fail fast message is printed is strange though. I wonder if that could be related to the wrong checks for mmap return value that were fixed in .NET 8 (#77952 https://github.com/dotnet/runtime/pull/77952, #78069 https://github.com/dotnet/runtime/pull/78069), but not ported to .NET

  1. It is actually quite possible it is the case.

— Reply to this email directly, view it on GitHub https://github.com/dotnet/runtime/issues/90230#issuecomment-1679590208, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYNLKOEWLUSY6RWU52S5DXVPN6DANCNFSM6AAAAAA3KBPHAE . You are receiving this because you were mentioned.Message ID: @.***>

mangod9 commented 1 year ago

@gregolsky, have you tried the scenario on 8 to ensure these fixes indeed work in your case?

ayende commented 12 months ago

What would be the expected scenario here? Ideally, I would rather get a proper error message / diagnostics so that we can detect that in production.

The usual metrics (memory usage, etc) are not a problem in this case.

mangod9 commented 4 weeks ago

@janvorli, believe there was a change to update the error message for these conditions?

janvorli commented 4 weeks ago

Yes, the message was updated in #102458

mangod9 commented 4 weeks ago

Ok closing this issue based on that.