cloudfoundry / jvmkill

Terminate the JVM when resources are exhausted
Apache License 2.0
30 stars 11 forks source link

Question: how can I tell *why* jvmkill has terminated an app? #17

Closed chefhoobajoob closed 4 years ago

chefhoobajoob commented 4 years ago

I only recently learned that the java buildpack was feeding memory settings to apps via JAVA_OPTS (our stack doesn't know anything about JAVA_OPTS), so my app's been successfully launching in the meantime using whatever defaults the JVM uses, no explicit memory configuration whatsoever.

Now that I've incorporated JAVA_OPTS into launch scripts, the jvm is picking up those settings plus the jvmkill-1.16.0_RELEASE agent, and I'm trying to come up to speed on how the memory configuration and jvmkill behavior is supposed to work, because now startup is consistently failing.

When app start fails, this is all I see in the logs:

Resource exhaustion event.
ResourceExhausted! (1/0)

Followed by memory and thread dump data, followed by:

jvmkill killing current process

I've been pushing memory configuration extremely high, way past what's reasonably required, to see if I can find even a single configuration that won't get killed by jvmkill, but so far no luck, I'm not finding any clues from either the memory or thread dumps, and jvmkill itself does not log any rationale for its decisions.

For example, for one recent attempt, the configured app memory was:

// NOTE: prior to jvmkill, this app was launching successfully with 1536M
memory: 3072M

The memory calculator configuration was:

JBP_CONFIG_OPEN_JDK_JRE: '{ jre: { version: 11.+ }, memory_calculator: { headroom: 0, stack_threads: 100 } }'
JAVA_OPTS: '-Xss4M'

At startup, this setup results in this memory configuration:

JVM Memory Configuration: -Xmx2297225K -XX:ReservedCodeCacheSize=240M -XX:MaxDirectMemorySize=10M -XX:MaxMetaspaceSize=182902K

For this startup sequence, I have the JAVA_OPTS value mapped over to JAVA_TOOL_OPTIONS, which results in this message from the jvm:

Picked up JAVA_TOOL_OPTIONS: -agentpath:/home/vcap/app/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=/home/vcap/tmp -Djava.ext.dirs= -Djava.security.properties=/home/vcap/app/.java-buildpack/java_security/java.security -Xss4M -Xmx2297225K -XX:ReservedCodeCacheSize=240M -XX:MaxDirectMemorySize=10M -XX:MaxMetaspaceSize=182902K

It runs for a bit until jvmkill kills it:

Resource exhaustion event.
ResourceExhausted! (1/0)

...with these memory stats:

Memory usage:
   Heap memory: init 50331648, used 149044776, committed 419274752, max 2274623488
   Non-heap memory: init 7667712, used 195653024, committed 248655872, max -1

Memory pool usage:
   CodeHeap 'non-nmethods': init 2555904, used 1622912, committed 2555904, max 5828608
   Metaspace: init 0, used 138790704, committed 187289600, max -1
   Tenured Gen: init 33554432, used 146334912, committed 288923648, max 1568669696
   CodeHeap 'profiled nmethods': init 2555904, used 27555328, committed 27590656, max 198410240
   Eden Space: init 13500416, used 2709864, committed 115933184, max 627572736
   Survivor Space: init 1638400, used 0, committed 14417920, max 78381056
   Compressed Class Space: init 0, used 18481992, committed 21979136, max 178900992
   CodeHeap 'non-profiled nmethods': init 2555904, used 9209728, committed 9240576, max 198414336

Heap
 def new generation   total 127296K, used 2646K [0x0000000773c00000, 0x000000077c610000, 0x00000007a2800000)
  eden space 113216K,   2% used [0x0000000773c00000, 0x0000000773e96580, 0x000000077aa90000)
  from space 14080K,   0% used [0x000000077b850000, 0x000000077b850000, 0x000000077c610000)
  to   space 14080K,   0% used [0x000000077aa90000, 0x000000077aa90000, 0x000000077b850000)
 tenured generation   total 282152K, used 142905K [0x00000007a2800000, 0x00000007b3b8a000, 0x0000000800000000)
   the space 282152K,  50% used [0x00000007a2800000, 0x00000007ab38e4c0, 0x00000007ab38e600, 0x00000007b3b8a000)
 Metaspace       used 135537K, capacity 165221K, committed 182900K, reserved 336500K
  class space    used 18048K, capacity 21094K, committed 21464K, reserved 174708K

...and a thread dump for these threads:

Full thread dump OpenJDK 64-Bit Server VM (11.0.2+9 mixed mode):

Threads class SMR info:
_java_thread_list=0x00000000010ab500, length=78, elements={
0x00007f806408d800, 0x00007f8064090000, 0x00007f80640ac800, 0x00007f80640ae800,
0x00007f80640b0800, 0x00007f80640b3800, 0x00007f8064155000, 0x00007f80641b0000,
0x00007f80648ad800, 0x00007f80649b6800, 0x00007f8064a2b000, 0x00007f8064015000,
0x00007f8064c55000, 0x00007f8064c56000, 0x00007f8064c58000, 0x00007f8064dfe000,
0x00007f8064e01800, 0x00007f8064e03800, 0x00007f80658d3000, 0x00007f80659dc000,
0x00007f8064d9d000, 0x00007f8065af2000, 0x00007f8065af4800, 0x00007f8065af6800,
0x00007f8065af8800, 0x00007f8065afa800, 0x00007f8065afd800, 0x000000000125e800,
0x0000000000f29800, 0x0000000000b4f000, 0x00007f8066455000, 0x00007f8066561800,
0x00007f8066595800, 0x00007f8066598000, 0x00007f806659a000, 0x00007f80665b7800,
0x00007f80668ab000, 0x00007f80668b2000, 0x00007f80668b5800, 0x00007f80668b7000,
0x00007f80668b8800, 0x00007f80668ba800, 0x00007f80668bd000, 0x00007f80668bf000,
0x00007f80668c1000, 0x00007f80668c3000, 0x00007f80668cd000, 0x00007f80668d2000,
0x00007f80668d7800, 0x00007f80668dc800, 0x00007f80668e2000, 0x00007f80668e7000,
0x00007f80668eb000, 0x00007f80668ec000, 0x0000000000b05800, 0x00007f8064f67800,
0x00007f8064f6e000, 0x00007f8065ba8000, 0x00007f80651df000, 0x00007f8066247800,
0x00007f8066222800, 0x00007f8066226000, 0x00007f8064825800, 0x00007f806492b000,
0x00007f8064939000, 0x00007f806493b800, 0x00000000024fa000, 0x0000000001c42800,
0x000000000396a000, 0x00007f80652bd000, 0x00007f8066008800, 0x00007f806600a800,
0x00007f8064815800, 0x00007f8064816800, 0x00007f8065eb9000, 0x0000000001f67000,
0x0000000002086800, 0x00000000010ac800
}

Nothing in any of these diagnostic data looks suspicious: heap utilization is ~6% of the max, metaspace is ~75% of the max, class space is below capacity, Compressed Class Space is well below the max, there are no code cache warnings, thread count is about right. The largest thread stacks are coming from Nashorn, with a depth range of 160-199, but it seems quite unlikely that either of those is exceeding 4M.

Anyway, I can't tell from any of this what jvmkill is reacting to - am I missing something obvious? Can I get jvmkill to be more verbose about what it's reacting to?

chefhoobajoob commented 4 years ago

I've found that if I explicitly hardwire the settings below, I can get a successful launch. I can't remove either setting or the application won't start. I do not want to have to explicitly set values for these, but I am unable to find an alternative configuration for the memory calculator that works, and jvmkill did not help in determining which of these settings would need to change - I had to randomly poke around with experimental values to find these.

JAVA_OPTS: '-XX:MaxMetaspaceSize=512M -XX:ReservedCodeCacheSize=512M'
nebhale commented 4 years ago

The information exposed by jvmkill is simply what the JVM exposes to JVMTI agents. When that agent receives a resource exhausted notification, it dumps as much as it can but cannot offer any further diagnostic information beyond that.

I'd be really surprised is the ReservedCodeCache is the resource that's being exhausted as it's a JIT cache that is evicted when more space is needed. Metaspace is a likely culprit though, especially if you're using nashorn and dynamically creating classes at runtime that would live in metaspace.

The memory calculator is the result of analysis of thousands of real-life applications and while its algorithm is good for most applications, it has a particular blindspot for applications that dynamically create classes (Groovy applications are significantly affected by this too). Because metaspace can only be determined by knowing the number of classes that will be loaded at runtime, if the classes don't exist on the filesystem before startup, then we can't use them in the calculation. Your choice to explicitly set the value for that memory region seems to be the right one for your application.

chefhoobajoob commented 4 years ago

Thanks @nebhale - we do run javascript on nashorn together with vanilla java classes in these services. I’m convinced you’re right that this blind spot is exactly what we ran into. We stopped processing JAVA_OPTS altogether for this reason - that was the fastest path back to stability.