Improve perf of omrmem_allocate_memory32

babsingh commented 10 months ago

Problem

omrmem_allocate_memory32 allocates native memory below 4G.

While running a Java virtual thread benchmark named Skynet, poor performance is seen in omrmem_allocate_memory32 as the benchmark exhausts the native memory below 4G. This either results in a time out or an OutOfMemoryError.

Potential Issues

It takes too much time to search for memory as more memory is used and the memory below 4G is exhausted.
There is a single lock/monitor and large critical sections which act as a bottleneck and make the alloc/free operations single threaded.
Fragmentation is prevalent and creates gaps (small chunks of unused memory), which prevents memory to be efficiently used.

Two Potential Solutions

[Approach 1] Improve the existing omrmem_allocate_memory32 implementation; or
[Approach 2] Adopt an existing memory allocator implementation.

Approach 2 is preferred since there are existing implementations for the memory allocator, which address the above perf issues.

Examples of Existing Memory Allocators

jemalloc: https://jemalloc.net
tcmalloc: https://github.com/google/tcmalloc
dlmalloc: https://gee.cs.oswego.edu/pub/misc/malloc.c

Verify Feasibility of Approach 2

Are the copyrights compatible such that the memory allocator can be embedded in OpenJ9 and OMR?
Time estimate on embedding the memory allocator in OMR. Key tasks:
- Makefiles will need to be updated to embed the memory allocator and use its API.
- Is it possible to control memory allocation below 4G? There might be an option to achieve this goal or we might need to modify the memory allocator's implementation.
- Functional testing: existing OpenJ9 and OMR should be sufficient.
- Performance evaluation plan: might take some effort since we will need to define metrics and find benchmarks to study the perf impact.

babsingh commented 10 months ago

@ThanHenderson Opened this issue to document our earlier discussion. All future updates can also be posted here.

fyi @tajila

ThanHenderson commented 7 months ago

Here's an update. (tldr; This doesn't appear to be an issue currently. And I think this can be closed.)

I'll answer some of the questions from above out-of-order:

While running a Java virtual thread benchmark named Skynet, poor performance is seen in omrmem_allocate_memory32 as the benchmark exhausts the native memory below 4G. This either results in a time out or an OutOfMemoryError.

I'vn't been able to observe this running hundreds of iterations of JDK21 builds on x86_64 Linux and Power Linux machines.

It takes too much time to search for memory as more memory is used...

Though our sub-4GB allocator isn't as efficient as just a call to malloc that the size_t-width referenced-objects use, any limited sub-n GB allocation is inherently less efficient because sub-n GB allocation requires looping and hinting to the underlying memory mapping system calls (at least in the case of POSIX-compliant system).

Is it possible to control memory allocation below 4G?

I've embedded (without LD_PRELOAD) a custom implementation of jemalloc[1,2,3] wherein allocations are constrained below the 4GB boundary. There isn't an allocator that I have found that allows for this type of constriction out-of-the-box based on the limitations discussed in the previous response. I chose jemalloc because: 1) its goal and design targets minimizing lock-contention; 2) it has positive results in various benchmarking reports, but namely [4]; and 3) the project uses make which integrates well into our build system (as opposed to other contenders like tcmalloc[5] which would add bazel as a dependency). Moreover, the only copyright-condition is retaining their copyright notice.

Performance evaluation plan

Since I'vn't been able to reproduce the Skynet regression, I also reached for the renaissance[6] test-suite. I ran many iterations of Skynet and each of the renaissance benchmarks -- with a particular focus on the memory-bound and contention benchmarks -- both with and without the custom jemalloc and did not observe any efficiency improvements (in terms of run-time) nor reduction in lock-contention (since I couldn't reproduce this in the first place).

Conclusion

In light of the above, I conclude that there shouldn't be any action taken since there is no observable problem with the current implementation of the sub-4GB allocator (at least on our tested workloads). And without the positive performance data to back up the embedding, the cost of supporting, shipping, and maintaining a custom third-party allocator is not ameliorated.

That being said, I will document -- in a comment below -- how I embedded jemalloc into OpenJ9 and OMR for furture reference, and preserve my patches and preliminary custom implementation in case we ever want to revisit this.

Aside

I was thinking we could also explore using jemalloc as the default allocator rather than malloc for all other allocations. But preliminary testing here too was neutral. And modern glibc malloc implementations already do deliver some of the advancement from other allocators like local caches and arenas. So I don't think there is anything fruitful here currently either.

[1] https://github.com/jemalloc/jemalloc [2] https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf [3] http://jemalloc.net/jemalloc.3.html [4] http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/ [5] https://github.com/google/tcmalloc [6] https://renaissance.dev/

ThanHenderson commented 7 months ago

For future interest/reference, here are the patches to embed a third-party allocator (well jemalloc specifically, but it should be the same for others too). OMR: https://github.com/ThanHenderson/omr/commit/94f023ba38c10ad38de8f655b0b409c5bc27fa92 OpenJ9: https://github.com/ThanHenderson/openj9/commit/a3aa879296a049fa3cbbaa7701c56314ff23c609

It is a little funky to get the cmake scripts in order; this solution works, but it would need more care if embedding for production e.g. jemalloc is built twice during the build process, for both OpenJ9 and OMR but the build results are in the same directory, a better solution would be to check if the library exists before invoking the fetch and build commands.

To keep things internal to the project, I used cmake's FetchContent functionality to pull from a hosted Git repo (this can be anything) and installed it locally rather than to one of the system library search paths. This required some configuring to force the use of, and make changes to, the rpaths of certain dependent targets because using target_link_libraries wasn't sufficient on its own and cmake doesn't support runpath. FetchContent relies on a CMakeLists.txt file in allocator's repo. Since jemalloc is make-based, I needed to create a wrapper file that simply invokes the appropriate build pipeline (autogen.sh, configure, make) e.g. shown below:

cmake_minimum_required(VERSION 3.1...3.28)

project(
  jemalloc 
  VERSION 1.0
  LANGUAGES CXX)

add_custom_target(nuke
    COMMAND make clean
        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
)

add_custom_target(jemalloc ALL
    COMMAND ./autogen.sh --disable-initial-exec-tls COMMAND make
        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
)

FetchContent resolves the repo at configure time and automatically invokes the CMakeLists.txt file for the external project at build time. Projects that are not make-based or cmake-based likely involve a similar pipeline.

Regardless of the allocator used, OMR just expects that malloc32 and free32 procedures are exported. Due to how some targets are linked to the shared library, one needs to take care not to export other functionality -- malloc, calloc, realloc, free etc. -- from the embedded allocator so as to not interfere with other allocations. This would likely cause some downstream issues in the GC and such.

Everything should be smooth sailing after that.

tajila commented 7 months ago

Thanks for your analysis @ThanHenderson

So it sounds like swapping the our current allocator for another one is not the answer. However, perhaps there is a way to tune our allocator to be more optimal for vthread workloads? Have you looked into toggling the initial region size of the heaps? the amount that is initially committed? Perhaps, there is a size that is more optimal for vthread heavy workloads.

ThanHenderson commented 7 months ago

@tajila I have not, but it seems like there is an opportunity for a limit study here. Other than region size, are there any other parameters that you think I should look into?

tajila commented 7 months ago

There is also PPG_mem_mem32_subAllocHeapMem32.suballocator_commitSize which TBH I havent looked at too closely. But it seems like we commit only a portion of the initial reservation size, then when we run out we commit more.

Overall, I think we should consider increasing the region size (static increase), but also look at a dynamic policy where the region size grows as we allocate more (we do something similar for class memory segments).

ThanHenderson commented 5 months ago

Perhaps, there is a size that is more optimal for vthread heavy workloads.

@tajila is there a collection of vthread heavy workloads somewhere? I've only been running the Skynet stress benchmark, and am not observing any difference from tweaking the commitSize and initialSize.

tajila commented 5 months ago

You can use helidon nima https://github.com/tomas-langer/helidon-nima-example.

I would configure jmeter to hit the endpoints (you can toggle the amount of load). The relevant endpoint in this case would be http://localhost:8080/parallel?...

I noticed pretty significant deltas between compressedrefs and non-compressedrefs with this example

ThanHenderson commented 5 months ago

I am noticing differences for helidon nima when increasing the initial region size: -Xgc:suballocatorInitialSize=<#> No noticeable difference when tweaking the commit size: -Xgc:suballocatorCommitSize=<#>

I was testing with spawning a new server 20 times and sending a curl -i http://localhost:8080/parallel\?count\=10000 request once on each server. On my machine, with -Xnocompressedrefs the requests averaged to 5.91 seconds. With compressedrefs enabled, and the default values for initialSize (200 MB) and commitSize (50 MB) the requests averaged to 11.87 seconds.

I iterated over initialSizes from 200 to 1000 MB with a step size of 100 MB. The data in the table below shows those average values from the iterations:

initialSize (MB)	Avg (s)
200 (default)	11.87
300	10.55
400	8.77
500	7.62
*600*	*7.17*
700	8.14
800	8.90
900	8.63
1000	8.46

On this workload, a static increase does show some benefit. It would be beneficial to test this on a set of workloads though that have observable improvements that may motivate a dynamic solution.

ThanHenderson commented 3 months ago

@dmitripivkine @amicic Do you guys know of any other workloads that could be important to test that could stress the change to the initial reservation size for the suballocator? Moreover, are there any consequences/reasons not to increase suballocatorInitialSize that we should consider?

dmitripivkine commented 3 months ago

@dmitripivkine @amicic Do you guys know of any other workloads that could be important to test that could stress the change to the initial reservation size for the suballocator? Moreover, are there any consequences/reasons not to increase suballocatorInitialSize that we should consider?

This is more VM than GC question. Suballocator area located below 4G bar is used for artifacts must be 32-bit for Compressed refs (j9class, j9vmthread etc.). The only GC aspect is not prevent most performant 0-shift run where entire heap is located below 4G bar as well. So, by taking more memory for Suballocator initially you can compromise 0-shilt runs. Also it is not clear for my how exactly Suballocator initial size effects your test performance.

ThanHenderson commented 3 months ago

Also it is not clear for my how exactly Suballocator initial size effects your test performance.

A larger initial size reduces the number of subsequent sub-4G allocations which can be costly due to how -- a brute force looping strategy -- sub-4G memory is reserved.

dmitripivkine commented 3 months ago

A larger initial size reduces the number of subsequent sub-4G allocations which can be costly due to how -- a brute force looping strategy -- sub-4G memory is reserved.

You comment is not very practical I afraid. The question where bottle neck is.

We looked to the code with @tajila and seems the logic is:

initial Suballocator memory reserved (200m by default, controlled by -Xgc:suballocatorInitialSize)
this memory is committed by chunks (50m default, controlled by -Xgc:suballocatorCommitSize)
after this if memory is exhausted we are adding extra memory to Suballocator by hardcoded size 8m chunks. I guess there is one of the critical points.

Would you please try to change this hardcoded value of 8m to larger one (50m or more) and try to measure performance again?

babsingh commented 3 months ago

If increasing HEAP_SIZE_BYTES leads to better perf, then we can look into adding an OpenJ9 cmdline option to specify the value of HEAP_SIZE_BYTES similar to -Xgc:suballocatorInitialSize/suballocatorCommitSize.

Note: On AIX, HEAP_SIZE_BYTES is set to 256 MB; on all other platforms, it is set to 8 MB.

dmitripivkine commented 3 months ago

There is another possibility you can try (please do it separate from increasing 8m size for clean result). In the same line of the code I mentioned before, last parameter in allocateRegion() is vmemAllocOptions and currently it is set to 0. Would you please try to set it to OMRPORT_VMEM_ALLOC_QUICK and measure performance again? I presume you are testing on Linux, so this option has effect on Linux only (but can be provided for all platforms). If it is set, instead of doing search memory loop Port Library code opens mmap file for the process and search for memory range there.

ThanHenderson commented 3 months ago

I added an -Xgc:suballocatorIncrementSize option to replace HEAP_SIZE_BYTES and tested over a range from 8 MB to 512 MB by running the same test as in https://github.com/eclipse/omr/issues/7190#issuecomment-2067372662 (don't compare values between the comments, only relative to the intra-comment baseline).

Baseline with -Xnocompressedrefs value: 8.42

The following table has the results from the compressedrefs runs:

incrementSize (MB)	Avg (s)
8 (default)	15.45
16	12.32
32	10.63
64	9.85
128	9.44
256	9.50
512	9.09

ThanHenderson commented 3 months ago

There is also this reference here: https://github.com/eclipse/omr/blob/b5ef5eda4680b6b5cf0c2f954362f9f47353ce04/port/common/omrmem32helpers.c#L50

Is VMDESIGN 1761 document still around?

babsingh commented 3 months ago

Is VMDESIGN 1761 document still around?

@pshipton might have a local backup.

pshipton commented 3 months ago

See https://ibm.ent.box.com/file/1073877430132 for the the VMDesign DB html. I've copied some of 1761 below.

Introduction

Currently, the J9Class referenced in the object header is allocated with the port library call j9mem_allocate_memory. On a 64-bit platform, this call may return memory outside of the low 4GB region, which is addressable only by the least significant 32 bits. The goal is to have J9Class reside in the low 4GB so that we can compress class pointer in the object header. This design focuses on modifying the VM so that RAM class segments are addressable by the low 32 bit on a 64 bit platform. Once this is achieved, class pointer compression can be implemented in the VM and GC.

High Level Design

Currently, we have port library calls j9mem_allocate_memory32 and j9mem_free_memory32 that does what we want, but in a pretty inefficient way. The function j9mem_allocate_memory32 uses a brute-force algorithm that linearly scans the 4GB region, attempts to allocate memory at locations starting from the beginning, until it either gets a successful allocation or has reached the end, indicating we are out of virtual address in the low 4GB. Obviously, they are not suitable for frequent use, as in the case of RAM class segment allocation. They will be rewritten to use a sub-allocator mechanism through the port library heap functions (implemented in ).

Define the low-32 heap structure as

typedef struct J9SubAllocateHeap32{
UDATA toalSize; //total size in bytes of the sub-allocated heap
UDATA occupiedSize; //total occupied heap size in byes
J9HeapWrapper* firstHeapWrapper; //wrapper struct to form a linked list of heaps
j9thread_monitor_t subAllocateHeap32Mutex; //monitor to serialize access
} J9SubAllocateHeap32;

where J9HeapWrapperis defined as:

typedef struct J9HeapWrapper{
J9HeapWrapper* nextHeapWrapper; //link to next heap wrapper in chain
J9Heap* heap; //start address of the heap
UDATA heapSize; //size of the heap
J9PortVmemIdentifier vmemID; //vmem identifier for the backing storage, needed to free the heap
} J9HeapWrapper;

J9SubAllocateHeap32will be stored in J9PortPlatformGlobals and indef’ed by defined(J9VM_ENV_DATA64) && !defined(J9ZOS390). Each J9HeapWrapper can be either malloced, or allocated as header to a J9Heap when we first allocate the backing storage for a heap.

Port library startup.

We only initialize fields in J9SubAllocateHeap32,no backing storage is allocated (firstHeapWrapper = NULL).

allocate memory from the heap

Note: for z/OS, the OS already provides API to allocate in low 4GB region (malloc31and free). The following design for allocate_memory32 and free_memory32 is for non-z/OS platforms. Call j9mem_allocate_memory32(struct J9PortLibrary portLibrary, UDATA byteAmount, char callSite), which will attempt to walk a heap starting from firstFreeBlockto find an empty block large enough. If firstHeapWrapper== NULL { allocate a memory region as the initial heap in low 4GB using the exsiting brute-force scanning algorithm described above. The default size of the memory region is 50 MB, SPECjbb runs will probably help us with this estimate. initialize firstHeapWrapperand call heap function j9mem_heap_create to initialize the heap. } we iterate through the linked list of heaps and call j9mem_heap_allocate on each heap untill the allocation request is met.

There are two cases where an allocation cannot be satisfied:

After traversing the list of heaps, we cannot find a free block large enough. We then allocate a new regular-sized heap and the requested block will be sub-allocated from there. The newly allocated heap is prepended at the beginning of heap list, and toalSize, occupiedSize and firstHeapWrapperwill be refreshed.

The requested size is larger than the heap size. In this case, we just allocate the requested size using the existing brute-force algorithm described previously and don't bother initializing it as a heap. Its J9HeapWrapper struct will have heap field set to NULL, indicating it's not a valid J9Heap and therefore will be skipped when walking the list.

free memory from the heap by calling void j9mem_free_memory32(struct J9PortLibrary portLibrary, void memPointer)

When freeing a block, we first determine its containing heap by traversing the linked list of heaps. We assume that there would normally be only a few heaps along the chain, so this work should not introduce much overhead. After we obtain the parent heap pointer, call j9mem_heap_freeto free the block. After a successful allocation or free, fields in J9SubAllocateHeap32are adjusted accordingly

port library shutdown

We iterate through the list of heaps and free them by calling vmem_free_memory on each one.

Risks It is possible that we may run on a system that doesn’t have enough memory in lower 4GB. This feature will be optional and controlled by build flags.

RAS Considerations The existing debug extension for dumping all memory segments should still work with the sub-allocator due to the transparency of the port library call. The existing memory check code that performs sub-allocation will eventually be simplified by calling the port library sub-allocation routines.

pshipton commented 3 months ago

There is a link to design 1754 which I didn't copy here. Let me know if you have problems accessing the VMDesign DB html.

ThanHenderson commented 3 months ago

I added an -Xgc option that enables passing OMRPORT_VMEM_ALLOC_QUICK when calling allocateRegion().

Here are the results:		Avg (s)
Non-compressed refs	5.20
Compressed refs w/o ALLOC QUICK	9.89
Compressed refs w ALLOC QUICK	5.16

Using OMRPORT_VMEM_ALLOC_QUICK leads to suballocator performance that is at least on-par with the non-compressed refs allocation (in this test). This seems like the best option out of the techniques here since it would be the least workload-dependant change.

Like Dmitri mentioned, it is only currently implemented on Linux, so we would need to update code in the omrvmem_reserve_memory_ex path in omrvmem.c for the other port libs to something similar to this and provide implementations for the procedures used therein.

If enabled by default, we could maintain the -Xgc option to disable using OMRPORT_VMEM_ALLOC_QUICK.

Separately, I think it would be good to also keep the cmdline option that I have for controlling HEAP_SIZE_BYTES discussed here.

eclipse / omr