Arm64: Higher calls to madvise/mprotect/mbind

kunalspathak commented 1 year ago

On TE JSON benchmark, we have noticed that more time is spent on linux/arm64 in syscalls like madvise, mprotect and mbind.

Ampere altra dual-socket:

Ampere altra single-socket:

Dell x64 r540:

The calls to mprotect have this backtrace:

CommitDoubleMappedMemory()
UnlockedLoaderHeap::CommitPages()
UnlockedLoaderHeap::GetMoreCommittedPages()
UnlockedLoaderHeap::UnlockedAllocMemForCode_NoThrow() or UnlockedLoaderHeap::UnlockedAllocMem_NoThrow()

Flame graphs are courtesy of @a74nh

kunalspathak commented 1 year ago

cc: @JulieLeeMSFT @a74nh

EgorBo commented 1 year ago

When you analyze TE graphs - do you ignore the "warm up" part or this flamegraph represent the whole session?

Because when I want to profile througput/latency, I usually skip first, say 30 seconds in the PerfView.

kunalspathak commented 1 year ago

My guess that UnlockedLoaderHeap::CommitPages is mostly startup related

Could be and it will be good to understand that because we are also trying to improve arm64 startup. I saw them when I was profiling paint.net startup too.

a74nh commented 1 year ago

When you analyze TE graphs - do you ignore the "warm up" part or this flamegraph represent the whole session?

Because when I want to profile througput/latency, I usually skip first, say 30 seconds in the PerfView.

These were done in the order:

start aspnet
do a warmup wrk run of 15 seconds
start perf
do a wrk run of 15 seconds
stop perf

a74nh commented 1 year ago

Some additional things to note:

Almost all of the time inside the madvise/mprotect/mbind is within the locking function osq_lock. The kernel is waiting for all the CPUs in the system - the more CPUs the more it has to wait for. For anything writing flag state, these calls are expensive - virtual memory areas (VMAs) are stored in a "Maple Tree" and access is via a read/write lock. Multiple concurrent readers are allowed but if a write lock is needed all the readers need to finish first. For example, the MADV_FREE flag the madvise code only needs a read lock for the vma traversal as it's not modifying the vma->flags. However DONTDUMP/DODUMP does need to modify the flags. Similarly for mprotect the VMA->flags are being modified thus a write lock is needed.

I tried reducing the usage of these calls. Firstly I got rid of madvise calls with DODUMP/DONTDUMP which just dictates whether a map should be added to the coredump. Across all machines, performance did not appear to be effected. Looking at the flamegraphs, madvise had vanished from the graph, but the cost of mprotect and mbind increased, keeping the total time in osq_lock consistent with the original run. This suggests to me that the locking within osq_lock is completely saturated.

Next I tried removing some of the mprotect calls (in addition to the above). Almost all the calls are due to memory allocation. One section in coreclr allocates a large chunk of memory, and then gives it out to the other parts of the vm in 4k chunks (via CommitDoubleMappedMemory). Each time this happens, it calls mprotect. I got rid of these calls and made the initial allocation give everything read+write+exec. Doing this did reduce the amount of time spent in osq_lock overall. Curiously there also appeared to now be additional calls into mprotect from elsewhere in the GC. At a pure guess, this was allowing the GC to get through more work during the same timeframe, but I'm not that familiar with that code. I tried the same to remove those calls too, but got stuck in segfaults.

Having a look at a run of the Java framework Netty, there are no calls to madvise/mprotect/mbind within the flamegraph. This suggests to me that OpenJDK isn't doing the same level of memory micromanagement that Coreclr is. Looking at the openjdk code, there are no uses of DODUMP/DONTDUMP

I understand that any changes in these areas will have security implications, so any changes won't be as simple as just removing all the calls.

kunalspathak commented 1 year ago

@janvorli

janvorli commented 1 year ago

@a74nh the DONTDUMP was added in https://github.com/dotnet/coreclr/pull/27089 to remedy a problem with huge core dumps . The core dumps were extremely large without it. As for reserving memory as R+W + X instead of PROT_NONE, it would break the W^X support we've introduced in .NET 6.0 (enabled by default in .NET 7). The W^X enabled can only have pages that are R+W or R+X, but not R+W+X. We could possibly map some memory as R+W or R+X from the very beginning, but when such memory is returned to the executable allocator and reused later, we would need to flip the protection in case the new block allocated from a place where there was a previous one had different protection. Also, not using PROT_NONE for memory that is just reserved gives up the protection against accessing memory that is not supposed to be accessible. It would be interesting to see how much gain you get in say aspnet benchmarks with your change to make everything R+W+X from the very beginning to get understanding the performance difference.

kunalspathak commented 1 year ago

to make everything R+W+X from the very beginning

Is that equivalent to DOTNET_EnableWriteXorExecute=0?

janvorli commented 1 year ago

Is that equivalent to DOTNET_EnableWriteXorExecute=0?

From the protection point of view, I would say yes. From the fine-grained implementation detail, it is not. When W^X is enabled, it uses shared memory for the allocations so that it can create secondary RW mappings for RX regions.

a74nh commented 1 year ago

It would be interesting to see how much gain you get in say aspnet benchmarks with your change to make everything R+W+X from the very beginning to get understanding the performance difference.

Running the test 20 times (plus 1 warmup), and calculating standard deviation and coefficient of variation for each set of 20: Requests/sec Avg:271891 Std Dev: 1117.92 CV: 0.411164 Requests/sec Avg:272652 Std Dev: 1555.01 CV: 0.570327 Requests/sec Avg:272576 Std Dev: 1481.02 CV: 0.543341 Requests/sec Avg:272581 Std Dev: 1297.21 CV: 0.4759 Requests/sec Avg:271408 Std Dev: 1108.94 CV: 0.408589 Requests/sec Avg:272940 Std Dev: 1444.37 CV: 0.52919

With DODUMP/DONTDUMP removed and removing most the calls to CommitDoubleMappedMemory(). Each set of 20 now looks like: Requests/sec Avg:274428 Std Dev: 900.245 CV: 0.328044 Requests/sec Avg:273760 Std Dev: 1091.33 CV: 0.398644 Requests/sec Avg:271993 Std Dev: 1166.95 CV: 0.429035 Requests/sec Avg:272946 Std Dev: 1179.54 CV: 0.432153 Requests/sec Avg:274146 Std Dev: 1182.08 CV: 0.431187 Requests/sec Avg:271913 Std Dev: 970.991 CV: 0.357096

In total, the Requests/sec are up by about 1%. Looking at the CV values, it's a difference of 20%.

(This is a single node altra machine with the test running across 16 CPUs. I was also alternating between which set of 20 runs I was doing).

kunalspathak commented 1 year ago

@janvorli - did you get a chance to look into this?

mangod9 commented 1 year ago

we havent looked into this yet due to other priorities.

kunalspathak commented 11 months ago

@a74nh the DONTDUMP was added in dotnet/coreclr#27089 to remedy a problem with huge core dumps . The core dumps were extremely large without it.

@janvorli - is it the case that without the calls to madvise with MADV_DODUMP or MADV_DONTDUMP, the memory will get included in coredump? If yes, can we just have calls with MADV_DONTDUMP and eliminate the calls with MADV_DODUMP because that memory will anyway get included in the coredump?

a74nh commented 11 months ago

@a74nh the DONTDUMP was added in dotnet/coreclr#27089 to remedy a problem with huge core dumps . The core dumps were extremely large without it.

@janvorli - is it the case that without the calls to madvise with MADV_DODUMP or MADV_DONTDUMP, the memory will get included in coredump? If yes, can we just have calls with MADV_DONTDUMP and eliminate the calls with MADV_DODUMP because that memory will anyway get included in the coredump?

Looking at the code, AIUI, memory can be in two states allocated and not committed (not visible to the subject application) or allocated and committed (visible to the subject application). When memory isn't committed it should be DONTDUMP, and when it is committed it should be DODUMP.

There's three over uses of DODUMP/DONTDUMP that I can see:

1. VirtualAlloc(MEM_COMMIT):

If memory not already allocated, then call VIRTUALReserveMemory() which calls ReserveVirtualMemory(). That does: mmap() + mprotect() + madvise(DONTDUMP)
Finally always mprotect() + madvise(DODUMP)

The DONTDUMP in the above isn't required. Probably just needs an extra flag passing down to ReserveVirtualMemory() to skip the madvise()

2. VirtualProtect() always calls madvise with the latest setting. However, it's possible the memory region already had that setting. Sadly, there's no way to query the kernel for the state. Would it possible to store the status of each chunk of memory - possibly in the RegionInformation? Not sure what extra memory and locking that would require.

3. VIRTUALResetMemory() calls madvise() twice, once with FREE and once with DONTDUMP. These could be combined easily.

If all the above doesn't fix the issues, then I suspect it's just due to the amount of memory going back and too between committed and uncommitted.

janvorli commented 11 months ago

Would it possible to store the status of each chunk of memory - possibly in the RegionInformation?

We actually want to move away from any accounting and windows style api simulation for virtual memory from PAL soon. So adding tracking of the state is something that would eventually be removed.

Optimizing the cases that @a74nh has identified where we call madvise twice sound good to me.

Regarding the issue in general though, I have not seen any real world benchmarks showing that the calls to madvice cause significant performance hit. The testing with removed madvise has shown only 1% difference in RPS and upto 20% difference in CV. That is very low and it might very well be a noise too.

a74nh commented 11 months ago

The graphs at the top of this issue are at least 6 months old now. So, I re-ran the tests on the latest HEAD.

Ampere altra single-socket:

aspcore_bench-20231204140305-ent-arm-39-ubuntu-dotnet-benchmark

Some things I noticed.

The part to the left of this is interesting (with backtrace SVR...). On the old graphs the backtrace was broken, but this has now been fixed. It is now obvious that this entire area is the GC.
The GC has shrunk from 10% to 4%. This is great.
However, third of the time in the GC is spent in mprotect() and madvise()
The backtrace for these comes from allocate_new_region() + generation_to_condemn()

The GC section:

The changes myself and Kunal discussed above can be applied to the GC, but, they don't make any difference to these graphs. I will still raise a PR as they are improvements.
I suspect the time in mprotect() and madvise() is due to the .NET memory model where allocated + uncommitted memory is marked as inaccessible, and allocated + committed memory is given access permissions. If memory is changing a lot between these two states then the time in those syscalls makes sense.

a74nh commented 11 months ago

I did a few more investigations. Rerunning the same test on multiple machines:

% of time spent in the GC:

Altra N1 (Dual node): 9.82%
Altra N1 (Single node): 4.11%
Yitian N2: 0.53%
c7g V1: 0.9%
x86_64 epyc-7251: 0.11%

I also noticed that the use of mprotect/madvise is dependant on the GC #define USE_REGIONS. Running the tests again:

Altra N1 (Dual node) with USE_REGIONS:
- Requests/sec Avg:228702 Std Dev: 12137.1 CV: 5.30693%
- Time spent in GC 9.82%
Altra N1 (Dual node) without USE_REGIONS:
- Requests/sec Avg:241285 Std Dev: 6784.61 CV: 2.81187%
- Time spent in GC 4.02%

This is a 5% performance boost, a halving of the CV (coefficient of variation), and halving the time spent in the GC.

Altra N1 (Single node) with USE_REGIONS:
- Requests/sec Avg:260657 Std Dev: 863.9 CV: 0.331432%
- Time spent in GC 4.11%
Altra N1 (Single node) without USE_REGIONS:
- Requests/sec Avg:261544 Std Dev: 1060.25 CV: 0.405379%
- Time spent in GC 3.61%

Here the GC time has reduced, but the perf and CV difference is within noise.

Other machines types showed no difference when changing USE_REGIONS.

kunalspathak commented 11 months ago

Thanks @a74nh for the experiments. So, seems #USE_REGIONS have more influence on dual node vs. single node. @mangod9 - do you recall if we have seen similar regressions?

mangod9 commented 11 months ago

we will have to take a look. @a74nh, are these measurements on the json benchmark still or a different benchmark?

a74nh commented 11 months ago

we will have to take a look. @a74nh, are these measurements on the json benchmark still or a different benchmark?

Yes, it's the json benchmark.

dotnet / runtime

Arm64: Higher calls to madvise/mprotect/mbind #87173