eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

Investigate possible memory leak at JITServer #12659

Open mpirvu opened 3 years ago

mpirvu commented 3 years ago

@AlexeyKhrabrov showed data that suggest JITServer may be leaking memory.

dmitry-ten commented 3 years ago

Do these memory leaks persist even after the client disconnects? Since we are using per-client persistent memory allocations for most of the compilation, if the leaking memory is inside a per-client allocation region, it should go away once the client disconnects. If the leak persists even then, it's most likely caused by some global persistent allocation. This could help significantly shorten the list of possible problems

mpirvu commented 3 years ago

Yes, the memory stays on after all clients disconnect. Also, I asked and the problem shows even when we don't share ROMClasses across JVM clients.

AlexeyKhrabrov commented 3 years ago

I'll just show the data. This graph shows the RSS of the jitserver process over time while it serves a total of 1600 AcmeAir clients, up to 64 at the same time. The middle of the graph where average memory usage slowly grows is a stable state. At shutdown, after all clients have disconnected, the RSS is 2.3 GB, so we seem to be leaking up to 1.4 MB per client. mem_process

Here is the jitserver's javacore before shutdown. Most of the 2.3 GB doesn't seems to be accounted for in the javacore, which suggests that it might be malloc() allocations. As far as I understand, the main users of malloc() at jitserver are std::strings and std::vectors used in message serialization/deserialization. There is also a possibility that the malloc() heap is holding on to a lot of memory even after it has been freed. javacore.20210423.020814.34241.0001.txt

dmitry-ten commented 3 years ago

Issue #5645 listed some previous memory leaks and #7937 fixed one of them. Another bug similar to that one. i.e. std::string allocated on the system heap during message exchange and never freed, seems possible.

dmitry-ten commented 3 years ago

If system heap allocations during message exchange are the culprit, we could probably rewrite the serialization/deserialization code to work with persistent/heap region allocated vectors and strings. I think the original reason why we used stl implementation was because of grpc/protobuf, but since that's long been removed, there is no need to keep using memory that cannot be accounted for by the JVM. First though, we'd need to confirm that it's indeed the cause of the leaks, and instrumenting code to print stack trace at each allocation like Marius did in the above issue seems like it should work. I'd be worried that it'll make the execution too slow though.

mpirvu commented 3 years ago

If I remember correctly, I instrumented the redefinition of the new operator:

#if !defined(J9VM_OPT_JITSERVER)
void *operator new(size_t size)
   {
#if defined(DEBUG)
   #if LINUX
   // glibc allocates something at dl_init; check if a method is being compiled to avoid
   // getting assumes at _dl_init
   if (firstCompileStarted)
   #endif
      {
      printf( "\n*** ERROR *** Invalid use of global operator new\n");
      TR_ASSERT(0,"Invalid use of global operator new");
      }
#endif
   return malloc(size);
   }
mpirvu commented 3 years ago

Another idea is to use Address Sanitizer. I have no experience with it though.

AlexeyKhrabrov commented 3 years ago

I did some more digging into this issue using malloc_stats() and instrumenting the JIT's global operator new defined in J9Compilation.cpp. It looks like #12458 is the root cause. Some of the data cached in the client session is stored in std::strings allocated with malloc rather than the persistent allocator and still needs to be deallocated.

So it was my own change that lead to this issue. The optimization in #12458 is only enabled for AOT cache (I also enabled it for non-aotcache in my experiments), so this doesn't actually affect the current mainline code. I will submit a PR with a fix when I have time.

However, even after the memory leak is fixed, there is still some non-negligible malloc footprint left, and a lot of fragmentation in the malloc heap. malloc_stats reports the following before jitserver shutdown in this experiment:

Total (incl. mmap):
system bytes     =  238596096
in use bytes     =  122283216

So there is ~230 MB of memory footprint due to malloc usage. I'm not sure how much of the ~120 MB "in use" memory is used by actual allocations rather than internal malloc data structures. I suspect that a lot of this memory overhead can be eliminated if we don't use std::strings and std::vectors in message serialization and client session caches.

dmitry-ten commented 3 years ago

Ok, then I guess it makes sense why I couldn't find any actual memory leaks after instrumenting the global operator new. I did my own experiment where I ran AcmeAir with load for 50 iterations, and compared server idle memory usage. Here are my results:

JITServer before any clients connect: 31.7MB
JITServer after 1 AcmeAir run: 49.9MB
JITServer after 50 AcmeAir runs: 78MB
Javacore delta: 28.89 MB

All of the extra memory used is visible in the javacore and it's all persistent memory, which is strange, all of the persistent memory segments are marked in use as well. This is not the expected behavior, because JITServer is using per-client persistent allocators, so the vast majority of persistent allocations should be destroyed once clients disconnect. Remaining global allocations are not numerous and almost all of them will occur during the first client run (e.g. allocations in Idiom Recognition code). I collected global persistent memory stats at each client disconnect, and I don't see any noticeable increases, as expected. It's possible that some large allocations are not accounted for in the stats, which would be the case if you access the allocator directly, instead of going through PersistentMemory but that seems unlikely.

mpirvu commented 3 years ago

@dmitry-ten Do you see a trend with each additional run? i.e does the increase from 49.9 MB to 78MB happen gradually (more or less)?

dmitry-ten commented 3 years ago

Yes, it happens gradually, i.e. we gain around 0.2-1MB of memory per run.

dmitry-ten commented 3 years ago

The most common global persistent memory allocations done on the jitserver should be MessageBuffer allocations, which happen whenever a buffer needs to be created or expanded to fit a large message. Another common allocations are optimization plans, which are allocated/freed once per compilation.

dmitry-ten commented 3 years ago

I determined that the increase in persistent memory consumption is due to fragmentation of global persistent memory. I instrumented the global persistent allocator to keep track of how much memory is actively in use, and how much memory has been allocated with the segment allocator, which is the underlying storage of persistent memory. I printed the stats in the ClientSessionData destructor, to measure usage at the end of each client run.

Results:
Global persistent memory in use: 1472B (does not increase over iterations, does not include allocations done at startup)
Global persistent memory allocated after 1 AcmeAir run: ~2MB
Global persistent memory allocated after 10 AcmeAir runs: ~12MB

The amount of allocated memory increases by ~10MB over 10 runs even though there are no memory leaks, which confirms that the memory is indeed fragmented. This happens due to MessageBuffer expansion interacting badly with persistent allocator implementation.

MessageBuffer capacity starts at 18K and expands to fit the message. In my experiment, it reached the max capacity of ~750K. Meanwhile, persistent allocator requests memory segments of size 1MB, creates blocks inside them and allocates memory from these blocks. When memory is freed, the block is added to the free list, so that it can be reused for future allocations.

However, the catch is that when the allocator creates a new segment, it will create a block of the requested size for the current allocation, and any excess memory will be split into a second block that will be immediately added to the free list. So, what happens as message buffers get created, expanded and destroyed?

As we create larger and larger buffers, eventually we will have to create a new segment and immediately use most of it for the buffer. Once the communication stream is destroyed (which can happen at client shutdown or if compilation thread suspends), the buffer is destroyed and the large block will be added to the free list. When we create a new stream, it will initially be much smaller, so it will take the large free block and split it into 2. As the buffer expands, it will continue splitting the large free block into smaller and smaller chunks, until it needs to request a new segment again. At this point, we will have 1 large newly allocated block, and a whole bunch of smaller free blocks that can never be used for large buffers. This can continue indefinitely, as whenever a large block is freed, it's likely to be split some time before we need to request a block this large again, causing the allocator to request a new segment. And since the global allocator does relatively few allocations, most of the small blocks will never be used again.

AlexeyKhrabrov commented 3 years ago

I think this can be improved if we implement buffer expansion slightly differently. Currently it's expanded based on the size of the message that doesn't fit into it: https://github.com/eclipse-openj9/openj9/blob/f3768f178f7e074395c73c90ffe3a9aefdcc3558/runtime/compiler/net/MessageBuffer.cpp#L38-L55

Instead, we can set starting capacity to a power of 2 (e.g. 16K or 32K), and during expansion round up the required capacity to a power of 2. Most of the time this will result in doubling of capacity on expansion. This way, the freed chunks will be from a small set of sizes (powers of 2), so they are a lot more likely to be reused for future stream buffers.

This also makes me think that a buddy allocator could work pretty well for this.

mpirvu commented 3 years ago

I agree with the fixed sizes starting with 32K. A buddy allocator (with block coalescing) should remains as a todo if time permits until GA.