Open babsingh opened 11 months ago
@ThanHenderson Opened this issue to document our earlier discussion. All future updates can also be posted here.
fyi @tajila
Here's an update. (tldr; This doesn't appear to be an issue currently. And I think this can be closed.)
I'll answer some of the questions from above out-of-order:
While running a Java virtual thread benchmark named Skynet, poor performance is seen in omrmem_allocate_memory32 as the benchmark exhausts the native memory below 4G. This either results in a time out or an OutOfMemoryError.
I'vn't been able to observe this running hundreds of iterations of JDK21 builds on x86_64 Linux
and Power Linux
machines.
It takes too much time to search for memory as more memory is used...
Though our sub-4GB allocator isn't as efficient as just a call to malloc
that the size_t
-width referenced-objects use, any limited sub-n GB allocation is inherently less efficient because sub-n GB allocation requires looping and hinting to the underlying memory mapping system calls (at least in the case of POSIX-compliant system).
Is it possible to control memory allocation below 4G?
I've embedded (without LD_PRELOAD
) a custom implementation of jemalloc
[1,2,3] wherein allocations are constrained below the 4GB boundary. There isn't an allocator that I have found that allows for this type of constriction out-of-the-box based on the limitations discussed in the previous response. I chose jemalloc
because: 1) its goal and design targets minimizing lock-contention; 2) it has positive results in various benchmarking reports, but namely [4]; and 3) the project uses make
which integrates well into our build system (as opposed to other contenders like tcmalloc
[5] which would add bazel
as a dependency). Moreover, the only copyright-condition is retaining their copyright notice.
Performance evaluation plan
Since I'vn't been able to reproduce the Skynet regression, I also reached for the renaissance
[6] test-suite. I ran many iterations of Skynet and each of the renaissance
benchmarks -- with a particular focus on the memory-bound
and contention
benchmarks -- both with and without the custom jemalloc
and did not observe any efficiency improvements (in terms of run-time) nor reduction in lock-contention (since I couldn't reproduce this in the first place).
In light of the above, I conclude that there shouldn't be any action taken since there is no observable problem with the current implementation of the sub-4GB allocator (at least on our tested workloads). And without the positive performance data to back up the embedding, the cost of supporting, shipping, and maintaining a custom third-party allocator is not ameliorated.
That being said, I will document -- in a comment below -- how I embedded jemalloc
into OpenJ9 and OMR for furture reference, and preserve my patches and preliminary custom implementation in case we ever want to revisit this.
I was thinking we could also explore using jemalloc
as the default allocator rather than malloc
for all other allocations. But preliminary testing here too was neutral. And modern glibc
malloc
implementations already do deliver some of the advancement from other allocators like local caches and arenas. So I don't think there is anything fruitful here currently either.
[1] https://github.com/jemalloc/jemalloc [2] https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf [3] http://jemalloc.net/jemalloc.3.html [4] http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/ [5] https://github.com/google/tcmalloc [6] https://renaissance.dev/
For future interest/reference, here are the patches to embed a third-party allocator (well jemalloc
specifically, but it should be the same for others too).
OMR: https://github.com/ThanHenderson/omr/commit/94f023ba38c10ad38de8f655b0b409c5bc27fa92
OpenJ9: https://github.com/ThanHenderson/openj9/commit/a3aa879296a049fa3cbbaa7701c56314ff23c609
It is a little funky to get the cmake
scripts in order; this solution works, but it would need more care if embedding for production e.g. jemalloc
is built twice during the build process, for both OpenJ9 and OMR but the build results are in the same directory, a better solution would be to check if the library exists before invoking the fetch and build commands.
To keep things internal to the project, I used cmake
's FetchContent
functionality to pull from a hosted Git repo (this can be anything) and installed it locally rather than to one of the system library search paths. This required some configuring to force the use of, and make changes to, the rpaths
of certain dependent targets because using target_link_libraries
wasn't sufficient on its own and cmake
doesn't support runpath
. FetchContent
relies on a CMakeLists.txt
file in allocator's repo. Since jemalloc
is make
-based, I needed to create a wrapper file that simply invokes the appropriate build pipeline (autogen.sh
, configure
, make
) e.g. shown below:
cmake_minimum_required(VERSION 3.1...3.28)
project(
jemalloc
VERSION 1.0
LANGUAGES CXX)
add_custom_target(nuke
COMMAND make clean
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
)
add_custom_target(jemalloc ALL
COMMAND ./autogen.sh --disable-initial-exec-tls COMMAND make
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
)
FetchContent
resolves the repo at configure time and automatically invokes the CMakeLists.txt
file for the external project at build time. Projects that are not make
-based or cmake
-based likely involve a similar pipeline.
Regardless of the allocator used, OMR just expects that malloc32
and free32
procedures are exported. Due to how some targets are linked to the shared library, one needs to take care not to export other functionality -- malloc
, calloc
, realloc
, free
etc. -- from the embedded allocator so as to not interfere with other allocations. This would likely cause some downstream issues in the GC and such.
Everything should be smooth sailing after that.
Thanks for your analysis @ThanHenderson
So it sounds like swapping the our current allocator for another one is not the answer. However, perhaps there is a way to tune our allocator to be more optimal for vthread workloads? Have you looked into toggling the initial region size of the heaps? the amount that is initially committed? Perhaps, there is a size that is more optimal for vthread heavy workloads.
@tajila I have not, but it seems like there is an opportunity for a limit study here. Other than region size, are there any other parameters that you think I should look into?
There is also PPG_mem_mem32_subAllocHeapMem32.suballocator_commitSize
which TBH I havent looked at too closely. But it seems like we commit only a portion of the initial reservation size, then when we run out we commit more.
Overall, I think we should consider increasing the region size (static increase), but also look at a dynamic policy where the region size grows as we allocate more (we do something similar for class memory segments).
Perhaps, there is a size that is more optimal for vthread heavy workloads.
@tajila is there a collection of vthread heavy workloads somewhere? I've only been running the Skynet stress benchmark, and am not observing any difference from tweaking the commitSize
and initialSize
.
You can use helidon nima https://github.com/tomas-langer/helidon-nima-example.
I would configure jmeter to hit the endpoints (you can toggle the amount of load). The relevant endpoint in this case would be http://localhost:8080/parallel?...
I noticed pretty significant deltas between compressedrefs and non-compressedrefs with this example
I am noticing differences for helidon nima
when increasing the initial region size:
-Xgc:suballocatorInitialSize=<#>
No noticeable difference when tweaking the commit size:
-Xgc:suballocatorCommitSize=<#>
I was testing with spawning a new server 20
times and sending a curl -i http://localhost:8080/parallel\?count\=10000
request once on each server. On my machine, with -Xnocompressedrefs
the requests averaged to 5.91
seconds. With compressedrefs enabled, and the default values for initialSize
(200 MB
) and commitSize
(50 MB
) the requests averaged to 11.87
seconds.
I iterated over initialSize
s from 200 to 1000 MB
with a step size of 100 MB
. The data in the table below shows those average values from the iterations:
initialSize (MB) | Avg (s) |
---|---|
200 (default) | 11.87 |
300 | 10.55 |
400 | 8.77 |
500 | 7.62 |
600 | 7.17 |
700 | 8.14 |
800 | 8.90 |
900 | 8.63 |
1000 | 8.46 |
On this workload, a static increase does show some benefit. It would be beneficial to test this on a set of workloads though that have observable improvements that may motivate a dynamic solution.
@dmitripivkine @amicic
Do you guys know of any other workloads that could be important to test that could stress the change to the initial reservation size for the suballocator?
Moreover, are there any consequences/reasons not to increase suballocatorInitialSize
that we should consider?
@dmitripivkine @amicic Do you guys know of any other workloads that could be important to test that could stress the change to the initial reservation size for the suballocator? Moreover, are there any consequences/reasons not to increase
suballocatorInitialSize
that we should consider?
This is more VM than GC question. Suballocator area located below 4G bar is used for artifacts must be 32-bit for Compressed refs (j9class, j9vmthread etc.). The only GC aspect is not prevent most performant 0-shift run where entire heap is located below 4G bar as well. So, by taking more memory for Suballocator initially you can compromise 0-shilt runs. Also it is not clear for my how exactly Suballocator initial size effects your test performance.
Also it is not clear for my how exactly Suballocator initial size effects your test performance.
A larger initial size reduces the number of subsequent sub-4G allocations which can be costly due to how -- a brute force looping strategy -- sub-4G memory is reserved.
A larger initial size reduces the number of subsequent sub-4G allocations which can be costly due to how -- a brute force looping strategy -- sub-4G memory is reserved.
You comment is not very practical I afraid. The question where bottle neck is.
We looked to the code with @tajila and seems the logic is:
-Xgc:suballocatorInitialSize
)-Xgc:suballocatorCommitSize
)Would you please try to change this hardcoded value of 8m to larger one (50m or more) and try to measure performance again?
If increasing HEAP_SIZE_BYTES
leads to better perf, then we can look into adding an OpenJ9 cmdline option to specify the value of HEAP_SIZE_BYTES
similar to -Xgc:suballocatorInitialSize/suballocatorCommitSize
.
Note: On AIX, HEAP_SIZE_BYTES
is set to 256 MB
; on all other platforms, it is set to 8 MB
.
There is another possibility you can try (please do it separate from increasing 8m size for clean result). In the same line of the code I mentioned before, last parameter in allocateRegion()
is vmemAllocOptions
and currently it is set to 0. Would you please try to set it to OMRPORT_VMEM_ALLOC_QUICK
and measure performance again? I presume you are testing on Linux, so this option has effect on Linux only (but can be provided for all platforms). If it is set, instead of doing search memory loop Port Library code opens mmap file for the process and search for memory range there.
I added an -Xgc:suballocatorIncrementSize
option to replace HEAP_SIZE_BYTES
and tested over a range from 8 MB to 512 MB by running the same test as in https://github.com/eclipse/omr/issues/7190#issuecomment-2067372662 (don't compare values between the comments, only relative to the intra-comment baseline).
Baseline with -Xnocompressedrefs
value: 8.42
The following table has the results from the compressedrefs
runs:
incrementSize (MB) | Avg (s) |
---|---|
8 (default) | 15.45 |
16 | 12.32 |
32 | 10.63 |
64 | 9.85 |
128 | 9.44 |
256 | 9.50 |
512 | 9.09 |
There is also this reference here: https://github.com/eclipse/omr/blob/b5ef5eda4680b6b5cf0c2f954362f9f47353ce04/port/common/omrmem32helpers.c#L50
Is VMDESIGN 1761
document still around?
Is VMDESIGN 1761 document still around?
@pshipton might have a local backup.
See https://ibm.ent.box.com/file/1073877430132 for the the VMDesign DB html. I've copied some of 1761 below.
Introduction
Currently, the J9Class referenced in the object header is allocated with the port library call j9mem_allocate_memory. On a 64-bit platform, this call may return memory outside of the low 4GB region, which is addressable only by the least significant 32 bits. The goal is to have J9Class reside in the low 4GB so that we can compress class pointer in the object header. This design focuses on modifying the VM so that RAM class segments are addressable by the low 32 bit on a 64 bit platform. Once this is achieved, class pointer compression can be implemented in the VM and GC.
High Level Design
Currently, we have port library calls j9mem_allocate_memory32 and j9mem_free_memory32 that does what we want, but in a pretty inefficient way. The function j9mem_allocate_memory32 uses a brute-force algorithm that linearly scans the 4GB region, attempts to allocate memory at locations starting from the beginning, until it either gets a successful allocation or has reached the end, indicating we are out of virtual address in the low 4GB. Obviously, they are not suitable for frequent use, as in the case of RAM class segment allocation. They will be rewritten to use a sub-allocator mechanism through the port library heap functions (implemented in ).
Define the low-32 heap structure as
typedef struct J9SubAllocateHeap32{
UDATA toalSize; //total size in bytes of the sub-allocated heap
UDATA occupiedSize; //total occupied heap size in byes
J9HeapWrapper* firstHeapWrapper; //wrapper struct to form a linked list of heaps
j9thread_monitor_t subAllocateHeap32Mutex; //monitor to serialize access
} J9SubAllocateHeap32;
where J9HeapWrapperis defined as:
typedef struct J9HeapWrapper{
J9HeapWrapper* nextHeapWrapper; //link to next heap wrapper in chain
J9Heap* heap; //start address of the heap
UDATA heapSize; //size of the heap
J9PortVmemIdentifier vmemID; //vmem identifier for the backing storage, needed to free the heap
} J9HeapWrapper;
J9SubAllocateHeap32will be stored in J9PortPlatformGlobals and indef’ed by defined(J9VM_ENV_DATA64) && !defined(J9ZOS390). Each J9HeapWrapper can be either malloced, or allocated as header to a J9Heap when we first allocate the backing storage for a heap.
Port library startup.
We only initialize fields in J9SubAllocateHeap32,no backing storage is allocated (firstHeapWrapper = NULL).
allocate memory from the heap
Note: for z/OS, the OS already provides API to allocate in low 4GB region (malloc31and free). The following design for allocate_memory32 and free_memory32 is for non-z/OS platforms. Call j9mem_allocate_memory32(struct J9PortLibrary portLibrary, UDATA byteAmount, char callSite), which will attempt to walk a heap starting from firstFreeBlockto find an empty block large enough. If firstHeapWrapper== NULL { allocate a memory region as the initial heap in low 4GB using the exsiting brute-force scanning algorithm described above. The default size of the memory region is 50 MB, SPECjbb runs will probably help us with this estimate. initialize firstHeapWrapperand call heap function j9mem_heap_create to initialize the heap. } we iterate through the linked list of heaps and call j9mem_heap_allocate on each heap untill the allocation request is met.
There are two cases where an allocation cannot be satisfied:
After traversing the list of heaps, we cannot find a free block large enough. We then allocate a new regular-sized heap and the requested block will be sub-allocated from there. The newly allocated heap is prepended at the beginning of heap list, and toalSize, occupiedSize and firstHeapWrapperwill be refreshed.
The requested size is larger than the heap size. In this case, we just allocate the requested size using the existing brute-force algorithm described previously and don't bother initializing it as a heap. Its J9HeapWrapper struct will have heap field set to NULL, indicating it's not a valid J9Heap and therefore will be skipped when walking the list.
free memory from the heap by calling void j9mem_free_memory32(struct J9PortLibrary portLibrary, void memPointer)
When freeing a block, we first determine its containing heap by traversing the linked list of heaps. We assume that there would normally be only a few heaps along the chain, so this work should not introduce much overhead. After we obtain the parent heap pointer, call j9mem_heap_freeto free the block. After a successful allocation or free, fields in J9SubAllocateHeap32are adjusted accordingly
port library shutdown
We iterate through the list of heaps and free them by calling vmem_free_memory on each one.
Risks It is possible that we may run on a system that doesn’t have enough memory in lower 4GB. This feature will be optional and controlled by build flags.
RAS Considerations The existing debug extension for dumping all memory segments should still work with the sub-allocator due to the transparency of the port library call. The existing memory check code that performs sub-allocation will eventually be simplified by calling the port library sub-allocation routines.
There is a link to design 1754 which I didn't copy here. Let me know if you have problems accessing the VMDesign DB html.
I added an -Xgc
option that enables passing OMRPORT_VMEM_ALLOC_QUICK
when calling allocateRegion()
.
Here are the results: | Avg (s) | |
---|---|---|
Non-compressed refs | 5.20 | |
Compressed refs w/o ALLOC QUICK | 9.89 | |
Compressed refs w ALLOC QUICK | 5.16 |
Using OMRPORT_VMEM_ALLOC_QUICK
leads to suballocator performance that is at least on-par with the non-compressed refs allocation (in this test). This seems like the best option out of the techniques here since it would be the least workload-dependant change.
Like Dmitri mentioned, it is only currently implemented on Linux, so we would need to update code in the omrvmem_reserve_memory_ex
path in omrvmem.c
for the other port libs to something similar to this and provide implementations for the procedures used therein.
If enabled by default, we could maintain the -Xgc
option to disable using OMRPORT_VMEM_ALLOC_QUICK
.
Separately, I think it would be good to also keep the cmdline option that I have for controlling HEAP_SIZE_BYTES
discussed here.
Problem
omrmem_allocate_memory32
allocates native memory below 4G.While running a Java virtual thread benchmark named Skynet, poor performance is seen in
omrmem_allocate_memory32
as the benchmark exhausts the native memory below 4G. This either results in a time out or an OutOfMemoryError.Potential Issues
Two Potential Solutions
omrmem_allocate_memory32
implementation; orApproach 2 is preferred since there are existing implementations for the memory allocator, which address the above perf issues.
Examples of Existing Memory Allocators
Verify Feasibility of Approach 2