dotCMS / core

Headless/Hybrid Content Management System for Enterprises
http://dotcms.com
Other
864 stars 466 forks source link

Run dotCMS with jemalloc #30619

Closed wezell closed 1 day ago

wezell commented 2 days ago

Parent Issue

No response

Task

Running dotCMS in our cloud infra, we have seen many pods getting OOM killed at the container level. This happens even if it seems that the JVM has plenty of heap headroom. In a more perfect world, I would expect to see the jvm die with an internal OOM at which point, we would know that, hey, java needs a bigger -Xmx, but this is not what is happening. Instead it seems that these containers are using untracked/off-heap/system memory which seems to grow in some cases and results in the containers getting killed.

Right now, our best guidance on sizing dotCMS's JVM in a container with large heaps is to run with -Xmx set to ~65% of available memory. This means that if we want to run with a 10GB heap, we need a 16GB RAM limit on the pod. 4GB overhead for the underlying OS is kind nuts and leads to resource over-allocation and excessive costs. It would be ideal (and more $$ efficient) if we could tighten that up to say, run -Xmx10g in a pod with 12GB RAM limit.

It seems that in some cases libraries that rely on JNI and "unsafe" off-heap memory allocations can cause system memory usage to leak/grow in a way that is very difficult to track. Apparently Linux's default memory allocator malloc can stack memory usage in a way that makes it impossible to be reclaimed by the system. A fix for this is to replace glibc's malloc implementation with on that does a better job allowing memory to be reclaimed - jemalloc is a memory allocator implementation that prevents memory fragmentation and allows system memory to be reclaimed.

I was looking into implementing a new image filter using libvips, which is a high performance image lib. This would rely on a JNI implementation. From reading about libvips, running it can cause memory usage to grow unbounded unless you use jemalloc or some other non-default memory allocator. This got me thinking that this might be some of our problems too. I know we use JNI in a number of places, including our image resizing libraries and saas compiler and with all the libs we include, we probably are using a bunch of "unsafe" operations in a number of places. I don't have a smoking gun test case but my gut is that moving to jemalloc has very little downsides and also has the very real possibility of improving our container memory usage profile

Proposed Objective

Application Performance

Proposed Priority

Priority 2 - Important

Acceptance Criteria

download the dotCMS docker image and run

export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2
export MALLOC_CONF=stats_print:true
java -version

You should see jemalloc stat output printed out.

External Links... Slack Conversations, Support Tickets, Figma Designs, etc.

References:

Assumptions & Initiation Needs

No response

Quality Assurance Notes & Workarounds

No response

Sub-Tasks & Estimates

No response

github-actions[bot] commented 2 days ago

PRs: