Open kunalspathak opened 1 year ago
cc: @JulieLeeMSFT @a74nh
When you analyze TE graphs - do you ignore the "warm up" part or this flamegraph represent the whole session?
Because when I want to profile througput/latency, I usually skip first, say 30 seconds in the PerfView.
My guess that UnlockedLoaderHeap::CommitPages is mostly startup related
Could be and it will be good to understand that because we are also trying to improve arm64 startup. I saw them when I was profiling paint.net startup too.
When you analyze TE graphs - do you ignore the "warm up" part or this flamegraph represent the whole session?
Because when I want to profile througput/latency, I usually skip first, say 30 seconds in the PerfView.
These were done in the order:
Some additional things to note:
Almost all of the time inside the madvise/mprotect/mbind is within the locking function osq_lock. The kernel is waiting for all the CPUs in the system - the more CPUs the more it has to wait for. For anything writing flag state, these calls are expensive - virtual memory areas (VMAs) are stored in a "Maple Tree" and access is via a read/write lock. Multiple concurrent readers are allowed but if a write lock is needed all the readers need to finish first. For example, the MADV_FREE flag the madvise code only needs a read lock for the vma traversal as it's not modifying the vma->flags. However DONTDUMP/DODUMP does need to modify the flags. Similarly for mprotect the VMA->flags are being modified thus a write lock is needed.
I tried reducing the usage of these calls. Firstly I got rid of madvise calls with DODUMP/DONTDUMP which just dictates whether a map should be added to the coredump. Across all machines, performance did not appear to be effected. Looking at the flamegraphs, madvise had vanished from the graph, but the cost of mprotect and mbind increased, keeping the total time in osq_lock consistent with the original run. This suggests to me that the locking within osq_lock is completely saturated.
Next I tried removing some of the mprotect calls (in addition to the above). Almost all the calls are due to memory allocation. One section in coreclr allocates a large chunk of memory, and then gives it out to the other parts of the vm in 4k chunks (via CommitDoubleMappedMemory). Each time this happens, it calls mprotect. I got rid of these calls and made the initial allocation give everything read+write+exec. Doing this did reduce the amount of time spent in osq_lock overall. Curiously there also appeared to now be additional calls into mprotect from elsewhere in the GC. At a pure guess, this was allowing the GC to get through more work during the same timeframe, but I'm not that familiar with that code. I tried the same to remove those calls too, but got stuck in segfaults.
Having a look at a run of the Java framework Netty, there are no calls to madvise/mprotect/mbind within the flamegraph. This suggests to me that OpenJDK isn't doing the same level of memory micromanagement that Coreclr is. Looking at the openjdk code, there are no uses of DODUMP/DONTDUMP
I understand that any changes in these areas will have security implications, so any changes won't be as simple as just removing all the calls.
@janvorli
@a74nh the DONTDUMP was added in https://github.com/dotnet/coreclr/pull/27089 to remedy a problem with huge core dumps . The core dumps were extremely large without it. As for reserving memory as R+W + X instead of PROT_NONE, it would break the W^X support we've introduced in .NET 6.0 (enabled by default in .NET 7). The W^X enabled can only have pages that are R+W or R+X, but not R+W+X. We could possibly map some memory as R+W or R+X from the very beginning, but when such memory is returned to the executable allocator and reused later, we would need to flip the protection in case the new block allocated from a place where there was a previous one had different protection. Also, not using PROT_NONE for memory that is just reserved gives up the protection against accessing memory that is not supposed to be accessible. It would be interesting to see how much gain you get in say aspnet benchmarks with your change to make everything R+W+X from the very beginning to get understanding the performance difference.
to make everything R+W+X from the very beginning
Is that equivalent to DOTNET_EnableWriteXorExecute=0
?
Is that equivalent to
DOTNET_EnableWriteXorExecute=0
?
From the protection point of view, I would say yes. From the fine-grained implementation detail, it is not. When W^X is enabled, it uses shared memory for the allocations so that it can create secondary RW mappings for RX regions.
It would be interesting to see how much gain you get in say aspnet benchmarks with your change to make everything R+W+X from the very beginning to get understanding the performance difference.
Running the test 20 times (plus 1 warmup), and calculating standard deviation and coefficient of variation for each set of 20: Requests/sec Avg:271891 Std Dev: 1117.92 CV: 0.411164 Requests/sec Avg:272652 Std Dev: 1555.01 CV: 0.570327 Requests/sec Avg:272576 Std Dev: 1481.02 CV: 0.543341 Requests/sec Avg:272581 Std Dev: 1297.21 CV: 0.4759 Requests/sec Avg:271408 Std Dev: 1108.94 CV: 0.408589 Requests/sec Avg:272940 Std Dev: 1444.37 CV: 0.52919
With DODUMP/DONTDUMP removed and removing most the calls to CommitDoubleMappedMemory(). Each set of 20 now looks like: Requests/sec Avg:274428 Std Dev: 900.245 CV: 0.328044 Requests/sec Avg:273760 Std Dev: 1091.33 CV: 0.398644 Requests/sec Avg:271993 Std Dev: 1166.95 CV: 0.429035 Requests/sec Avg:272946 Std Dev: 1179.54 CV: 0.432153 Requests/sec Avg:274146 Std Dev: 1182.08 CV: 0.431187 Requests/sec Avg:271913 Std Dev: 970.991 CV: 0.357096
In total, the Requests/sec are up by about 1%. Looking at the CV values, it's a difference of 20%.
(This is a single node altra machine with the test running across 16 CPUs. I was also alternating between which set of 20 runs I was doing).
@janvorli - did you get a chance to look into this?
we havent looked into this yet due to other priorities.
@a74nh the DONTDUMP was added in dotnet/coreclr#27089 to remedy a problem with huge core dumps . The core dumps were extremely large without it.
@janvorli - is it the case that without the calls to madvise with MADV_DODUMP or MADV_DONTDUMP, the memory will get included in coredump? If yes, can we just have calls with MADV_DONTDUMP and eliminate the calls with MADV_DODUMP because that memory will anyway get included in the coredump?
@a74nh the DONTDUMP was added in dotnet/coreclr#27089 to remedy a problem with huge core dumps . The core dumps were extremely large without it.
@janvorli - is it the case that without the calls to madvise with MADV_DODUMP or MADV_DONTDUMP, the memory will get included in coredump? If yes, can we just have calls with MADV_DONTDUMP and eliminate the calls with MADV_DODUMP because that memory will anyway get included in the coredump?
Looking at the code, AIUI, memory can be in two states allocated and not committed (not visible to the subject application) or allocated and committed (visible to the subject application). When memory isn't committed it should be DONTDUMP, and when it is committed it should be DODUMP.
There's three over uses of DODUMP/DONTDUMP that I can see:
1.
VirtualAlloc(MEM_COMMIT)
:
VIRTUALReserveMemory()
which calls ReserveVirtualMemory()
. That does: mmap() + mprotect() + madvise(DONTDUMP)
mprotect(
) + madvise(DODUMP)
The DONTDUMP
in the above isn't required. Probably just needs an extra flag passing down to ReserveVirtualMemory()
to skip the madvise()
2.
VirtualProtect()
always calls madvise
with the latest setting. However, it's possible the memory region already had that setting. Sadly, there's no way to query the kernel for the state.
Would it possible to store the status of each chunk of memory - possibly in the RegionInformation? Not sure what extra memory and locking that would require.
3.
VIRTUALResetMemory()
calls madvise()
twice, once with FREE
and once with DONTDUMP
. These could be combined easily.
If all the above doesn't fix the issues, then I suspect it's just due to the amount of memory going back and too between committed and uncommitted.
Would it possible to store the status of each chunk of memory - possibly in the RegionInformation?
We actually want to move away from any accounting and windows style api simulation for virtual memory from PAL soon. So adding tracking of the state is something that would eventually be removed.
Optimizing the cases that @a74nh has identified where we call madvise twice sound good to me.
Regarding the issue in general though, I have not seen any real world benchmarks showing that the calls to madvice cause significant performance hit. The testing with removed madvise has shown only 1% difference in RPS and upto 20% difference in CV. That is very low and it might very well be a noise too.
The graphs at the top of this issue are at least 6 months old now. So, I re-ran the tests on the latest HEAD.
Ampere altra single-socket:
Some things I noticed.
SVR...
). On the old graphs the backtrace was broken, but this has now been fixed. It is now obvious that this entire area is the GC.mprotect()
and madvise()
allocate_new_region()
+ generation_to_condemn()
The GC section:
mprotect()
and madvise()
is due to the .NET memory model where allocated + uncommitted memory is marked as inaccessible, and allocated + committed memory is given access permissions. If memory is changing a lot between these two states then the time in those syscalls makes sense.I did a few more investigations. Rerunning the same test on multiple machines:
% of time spent in the GC:
I also noticed that the use of mprotect/madvise is dependant on the GC #define USE_REGIONS
. Running the tests again:
This is a 5% performance boost, a halving of the CV (coefficient of variation), and halving the time spent in the GC.
Here the GC time has reduced, but the perf and CV difference is within noise.
Other machines types showed no difference when changing USE_REGIONS.
Thanks @a74nh for the experiments. So, seems #USE_REGIONS
have more influence on dual node vs. single node. @mangod9 - do you recall if we have seen similar regressions?
we will have to take a look. @a74nh, are these measurements on the json benchmark still or a different benchmark?
we will have to take a look. @a74nh, are these measurements on the json benchmark still or a different benchmark?
Yes, it's the json benchmark.
On TE JSON benchmark, we have noticed that more time is spent on linux/arm64 in syscalls like
madvise
,mprotect
andmbind
.Ampere altra dual-socket:
Ampere altra single-socket:
Dell x64 r540:
The calls to mprotect have this backtrace:
Flame graphs are courtesy of @a74nh