Inconsistent behaviour with -XX:+IdleTuningGcOnIdle, mem not released back to OS on Idle everytime.

dinogun commented 6 years ago

I am raising this bug on behalf of Ruslan Synytsky. He reported that when using this app with the following commandline, RES mem as seen by top on idle does not reduce as expected every time. java -Xmx3276M -Xms32M -XX:+IdleTuningCompactOnIdle -XX:+IdleTuningGcOnIdle -XX:IdleTuningMinIdleWaitTime=10 -jar app.jar 1

dinogun commented 6 years ago

I have been able to recreate this and it appears that the RES mem reduces on multiple load invocations but not every time as expected.

dinogun commented 6 years ago

@ashu-mehra fyi

ashu-mehra commented 6 years ago

I ran the app using following java version:

openjdk version "9.0.4-adoptopenjdk"
OpenJDK Runtime Environment (build 9.0.4-adoptopenjdk+12)
Eclipse OpenJ9 VM (build master-be905b0c, JRE 9 Linux amd64-64-Bit Compressed References 20180710_223 (JIT enabled, AOT enabled)
OpenJ9   - be905b0c
OMR      - 2a9e1ca0
JCL      - 428fef795 based on jdk-9.0.4+12)

Command used is: java -Xmx3276M -Xms32M -XX:+IdleTuningCompactOnIdle -XX:+IdleTuningGcOnIdle -XX:IdleTuningMinIdleWaitTime=10 -verbose:gc -jar app.jar 1

After the load generation was over (after pressing enter key), I noticed the RSS size of the java process, as reported by top, is 1.290g Following output was reported after the load generation was over:

total time = 6364, ms
done
03:02:11 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:14 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:17 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:20 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:23 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:26 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:29 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:32 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:35 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:38 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:41 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:44 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:47 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:50 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:53 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:56 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:02:59 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:03:02 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
<heap-resize id="636" type="release free pages" space="tenure" amount="379011072" count="1" timems="0.020" reason="idle" timestamp="2018-07-11T15:03:05.358" />
03:03:05 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:03:08 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M 
03:03:11 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M

Note that after around a minute, there is heap-resize message generated by gc which says around 361MB of free pages were released. But the RSS of the java process did not decrease.

After generating the load second time, RSS size increased to 1.840g After second load, following output was generated:

total time = 4609, ms
done
03:03:29 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:03:32 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:03:35 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:03:38 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:03:41 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:03:44 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:03:47 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:03:50 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:03:53 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:03:56 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:03:59 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:04:02 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:04:05 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:04:08 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:04:11 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:04:14 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:04:17 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:04:20 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
03:04:23 -> Init: 32M Used: 1442M Committed: 2024M Max: 3276M 
<exclusive-start id="744" timestamp="2018-07-11T15:04:24.061" intervalms="55501.715">
  <response-info timems="0.015" idlems="0.015" threads="0" lastid="0000000001BB0300" lastname="VM Runtime State Listener" />
</exclusive-start>
<sys-start reason="vm idle" id="745" timestamp="2018-07-11T15:04:24.061" intervalms="144569.070" />
<cycle-start id="746" type="global" contextid="0" timestamp="2018-07-11T15:04:24.061" intervalms="58734.592" />
<gc-start id="747" type="global" contextid="746" timestamp="2018-07-11T15:04:24.061">
  <mem-info id="748" free="609926992" total="2122121216" percent="28">
    <mem type="nursery" free="169929688" total="858783744" percent="19">
      <mem type="allocate" free="169929688" total="435486720" percent="39" />
      <mem type="survivor" free="0" total="423297024" percent="0" />
    </mem>
    <mem type="tenure" free="439997304" total="1263337472" percent="34">
      <mem type="soa" free="376830840" total="1200171008" percent="31" />
      <mem type="loa" free="63166464" total="63166464" percent="100" />
    </mem>
    <remembered-set count="69" />
  </mem-info>
</gc-start>
<allocation-stats totalBytes="127907680" >
  <allocated-bytes non-tlh="4104" tlh="127903576" />
  <largest-consumer threadName="main" threadId="0000000001A52700" bytes="127731680" />
</allocation-stats>
<gc-op id="749" type="mark" timems="7.381" contextid="746" timestamp="2018-07-11T15:04:24.068">
  <trace-info objectcount="37232" scancount="30202" scanbytes="937308" />
  <finalization candidates="95" enqueued="0" />
  <ownableSynchronizers candidates="1" cleared="0" />
  <references type="soft" candidates="2591" cleared="0" enqueued="0" dynamicThreshold="31" maxThreshold="32" />
  <references type="weak" candidates="778" cleared="0" enqueued="0" />
  <references type="phantom" candidates="21" cleared="0" enqueued="0" />
  <stringconstants candidates="4340" cleared="0"  />
</gc-op>
<gc-op id="750" type="classunload" timems="0.004" contextid="746" timestamp="2018-07-11T15:04:24.068">
  <classunload-info classloadercandidates="7" classloadersunloaded="0" classesunloaded="0" anonymousclassesunloaded="0" quiescems="0.000" setupms="0.003" scanms="0.001" postms="0.000" />
</gc-op>
<gc-op id="751" type="sweep" timems="2.971" contextid="746" timestamp="2018-07-11T15:04:24.071" />
<mem-info id="752" free="1697301464" total="2122121216" percent="79">
  <mem type="nursery" free="435486720" total="858783744" percent="50">
    <mem type="allocate" free="435486720" total="435486720" percent="100" />
    <mem type="survivor" free="0" total="423297024" percent="0" />
  </mem>
  <mem type="tenure" free="1261814744" total="1263337472" percent="99" micro-fragmented="48443" macro-fragmented="1041779">
    <mem type="soa" free="1198648280" total="1200171008" percent="99" />
    <mem type="loa" free="63166464" total="63166464" percent="100" />
  </mem>
  <remembered-set count="66" />
</mem-info>
<gc-op id="753" type="compact" timems="26.516" contextid="746" timestamp="2018-07-11T15:04:24.098">
  <compact-info movecount="31778" movebytes="1247440" reason="forced gc with compaction" />
</gc-op>
<gc-end id="754" type="global" contextid="746" durationms="37.566" usertimems="89.172" systemtimems="1.758" timestamp="2018-07-11T15:04:24.098" activeThreads="4">
  <mem-info id="755" free="1697351000" total="2122121216" percent="79">
    <mem type="nursery" free="435486720" total="858783744" percent="50">
      <mem type="allocate" free="435486720" total="435486720" percent="100" />
      <mem type="survivor" free="0" total="423297024" percent="0" />
    </mem>
    <mem type="tenure" free="1261864280" total="1263337472" percent="99">
      <mem type="soa" free="1198697816" total="1200171008" percent="99" />
      <mem type="loa" free="63166464" total="63166464" percent="100" />
    </mem>
    <remembered-set count="66" />
  </mem-info>
</gc-end>
<cycle-end id="756" type="global" contextid="746" timestamp="2018-07-11T15:04:24.098" />
<sys-end id="757" timestamp="2018-07-11T15:04:24.098" />
<exclusive-end id="758" timestamp="2018-07-11T15:04:24.098" durationms="37.973" />

<heap-resize id="759" type="release free pages" space="tenure" amount="1261850624" count="1" timems="76.973" reason="idle" timestamp="2018-07-11T15:04:24.175" />
03:04:26 -> Init: 32M Used: 405M Committed: 2024M Max: 3276M 
03:04:29 -> Init: 32M Used: 405M Committed: 2024M Max: 3276M 
03:04:32 -> Init: 32M Used: 405M Committed: 2024M Max: 3276M

Note that this time during idle GC kicked in after which heap-resize message was generated by gc which says around 1203MB of free pages were released. After this the RSS size came down to 0.821g

ashu-mehra commented 6 years ago

Looking at the code that de-commits memory during idle, looks like we don't maintain which pages have been de-committed so far. So if a page was de-committed during previous idle cycle and was not touched at all, then it would again be de-committed by gc in the next idle cycle. This is not an issue per se, but it would cause problem in accounting as the same page would again be counted in the amount released to OS reported by heap-resize tag in gc logs. This would result in mismatch in the amount value mentioned by heap-resize and the decrease in RSS reported by top.

ashu-mehra commented 6 years ago

I also noticed that after first cycle of load generation, heap usage was: 03:02:11 -> Init: 32M Used: 1246M Committed: 1624M Max: 3276M That means app was using 1246MB out of 1624MB of the current heap size. Around 378MB of heap was not touched by the application. This is close to 361MB of free pages reportedly de-committed during idle. But since these pages were actually never touched by the application, de-committing them had no effect and RSS value didn't go down at all during the first heap-resize.

ashu-mehra commented 6 years ago

@amicic Aleks, can you please take a look at this and correct me if my understanding in above comments is wrong.

ashu-mehra commented 6 years ago

Another observation when running the above sample app is that GC does not trigger during first idle phase even though there is hundreds of MB of garbage in the heap. Reason being failure to meet the following condition: https://github.com/eclipse/omr/blob/fa0f9feb347c5d29a7bc232adcac2ba07a97de74/gc/base/MemorySubSpace.cpp#L939

            if ((0 < freeMemorySize) && (_extensions->gcOnIdleRatio > (((previousMemorySize - freeMemorySize) * 100) / actvMemorySize))) {
                invokeGC = false;
            }

This condition prevents GC if the amount of allocation since last GC cycle is below a threshold (i.e. gcOnIdleRation) which is currently 40% of the active memory size. I verified this by printing the value of (previousMemorySize - freeMemorySize) * 100) / actvMemorySize)) Attaching GC logs for this case: gc.log.txt

If I manually remove the check for 'gcOnIdleRatio' in the code then the GC cycle during first idle phase is able to collect 800+ MB of data and we also see same order of reduction in RES value. GC logs for this case: gc.log.noratio.txt

I understand this behavior is specific to the app and cannot be generalized.

But I am wondering if we can make this ratio a tune-able via a command line option. This would allow the user to set the ratio based on their workload pattern and their comfort level with GC cycle during idle phase (in case they feel default ratio is too aggressive or too conservative).

@dmitripivkine @amicic your thoughts on this?

ashu-mehra commented 6 years ago

I forgot to mention, I had a chat with @amicic regarding incorrect reporting (in heap-resize tag) of the amount of heap de-committed (see https://github.com/eclipse/openj9/issues/2312#issuecomment-404110641) . To fix this we have few options: 1) Explicitly maintain a list of free pages which have been touched by the VM (and are therefore backed by physical page) and de-commit the pages only in this list. This is not easy to achieve. 2) As suggested by @amicic we take help from OS to determine if the candidate page for de-commit is actually backed by physical memory or not. If not, then skip de-committing it. I have not come across any API that can provide this information. I looked at /proc/<pid>/smaps file but it does not give page wise information. /proc/pid/pagemap can be used to get physical page frame for each virtual page but from Linux 4.0 and above you need CAP_SYS_ADMIN capability to read it as per https://unix.stackexchange.com/questions/345915/how-to-change-permission-of-proc-self-pagemap-file/383838#383838 3) Another workaround suggested by @amicic is to get the process RES size before idle phase and after idle phase, and report the difference in the heap-resize tag in gc logs.

I think at this point option 3 is best suited, in which case we would need to figure out mechanism of getting process size on all supported platforms where idle tuning is available.

ashu-mehra commented 6 years ago

@dmitripivkine @amicic can you please take a look at https://github.com/eclipse/openj9/issues/2312#issuecomment-408513209 and provide your comments.

dmitripivkine commented 6 years ago

If I did not miss something:

I guess system GC should be requested unconditionally before entering idle mode
I would not try to distinguish backed and not backed pages. The only problem I can see it is impossible to return back exact percentage of really taken memory. You can solve it for example by returning all unused memory so do the best at the moment. There is make sense to explain customers how it works. The output from RES mem will be inconsistent any way and we can not avoid this.
I support Aleks suggestion to add heap-resize tag to gc verbose log. However as I remember from -Xsoftmx test experience there might be a seconds or even minutes until OS updates it's output. Test releases half of heap (256m of 512m) and waits up to 5 minutes (sic!) until released memory shows up in OS (and it fails time to time)

ashu-mehra commented 6 years ago

I guess system GC should be requested unconditionally before entering idle mode

I am not aware of the background for gcOnIdleRatio but I guess it is to ensure we do a GC only if there is substantial amount of heap allocation since last GC cycle. @amicic do you remember the details for gcOnIdleRatio check. I am fine with doing GC unconditionally, given that we are in idle mode, or if we want it to be customizable, we can add an option to control gcOnIdleRatio and set its default value as 0.

I support Aleks suggestion to add heap-resize tag to gc verbose log.

We already emit a heap-resize tag when heap memory is de-committed during idle phase.

However as I remember from -Xsoftmx test experience there might be a seconds or even minutes until OS updates it's output. Test releases half of heap (256m of 512m) and waits up to 5 minutes (sic!) until released memory shows up in OS (and it fails time to time)

So, it looks like there is no way out to reliably and accurately determine the amount of heap memory de-committed, does it make any sense to display the amount of memory de-committed in heap-resize tag. It only adds to the confusion and can't be relied on at all. @dmitripivkine do you remember if there are any specific platforms on which you noticed the delay in OS updating its output? Can you please point out the -Xsoftmx test that you are referring to.

dmitripivkine commented 6 years ago

The test I mentioned is testSoftMxDisclaimMemory. The source for it: https://github.com/eclipse/openj9/blob/master/test/functional/JLM_Tests/src/j9vm/test/softmx/SoftmxAdvanceTest.java I don't remember is there any platform where it failed more often.

siruslan commented 6 years ago

I'm going to talk about elasticity of JVM at Code One and would love to mention OpenJ9 as well. However, after reading this thread I'm still not sure if it works properly already / if the issue was fixed. Can somebody summarize the current state and the plans on it? So I can pass this info to the audience without distortion.

DanHeidinga commented 6 years ago

@siruslan That's awesome to hear!

@dinogun @ashu-mehra Can you update @siruslan on the current state of the idle detection work?

ashu-mehra commented 6 years ago

I'm still not sure if it works properly already / if the issue was fixed.

I wouldn't say that there was any functional issue. We did make a code change in https://github.com/eclipse/omr/pull/2877 to ensure GC on every idle cycle detected by JVM rather than depending on a heuristic. Only confusion is the amount displayed in heap-resize tag may not match with the reduction in RES size reported by top, which mainly arises due to incorrect interpretation of that value and we plan to clarify it by adding proper documentation (as discussed in https://github.com/eclipse/omr/pull/2876).

I recently tested the app again with latest OpenJDK11 + OpenJ9 build https://adoptopenjdk.net/index.html?variant=openjdk11&jvmVariant=openj9, and it seems to be working fine now.

java version:

openjdk version "11" 2018-09-25
OpenJDK Runtime Environment AdoptOpenJDK (build 11+28)
Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.10.0, JRE 11 Linux amd64-64-Bit Compressed References 20181003_44 (JIT enabled, AOT enabled)
OpenJ9   - e44c4716
OMR      - 32df9563
JCL      - eeda2c2d02 based on jdk-11+28)

Command: java -Xmx3276M -Xms32M -XX:+IdleTuningCompactOnIdle -XX:+IdleTuningGcOnIdle -XX:IdleTuningMinIdleWaitTime=10 -verbose:gc -jar app.jar 1

After the load generation was over (after pressing enter key), RES size reported by top was 1.385g. After the idle cycle, GC kicked in and it reduced the size to 0.475g This is the output I got:

10:21:22 -> Init: 32M Used: 1262M Committed: 1477M Max: 3276M 
10:21:25 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:28 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:31 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:34 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:37 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:40 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:43 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:46 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:49 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:52 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:55 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:21:58 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:22:01 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:22:04 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:22:07 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:22:10 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:22:13 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
10:22:16 -> Init: 32M Used: 1262M Committed: 1830M Max: 3276M 
<exclusive-start id="647" timestamp="2018-10-17T22:22:17.867" intervalms="55025.599">
  <response-info timems="0.026" idlems="0.026" threads="0" lastid="0000000002137000" lastname="VM Runtime State Listener" />
</exclusive-start>
<sys-start reason="vm idle" id="648" timestamp="2018-10-17T22:22:17.867" intervalms="73538.875" />
<cycle-start id="649" type="global" contextid="0" timestamp="2018-10-17T22:22:17.867" intervalms="57092.317" />
<gc-start id="650" type="global" contextid="649" timestamp="2018-10-17T22:22:17.867">
  <mem-info id="651" free="595505568" total="1918566400" percent="31">
    <mem type="nursery" free="173348232" total="511442944" percent="33">
      <mem type="allocate" free="173348232" total="274268160" percent="63" />
      <mem type="survivor" free="0" total="237174784" percent="0" />
    </mem>
    <mem type="tenure" free="422157336" total="1407123456" percent="30">
      <mem type="soa" free="351801368" total="1336767488" percent="26" />
      <mem type="loa" free="70355968" total="70355968" percent="100" />
    </mem>
    <remembered-set count="72" />
  </mem-info>
</gc-start>
<allocation-stats totalBytes="131832" >
  <allocated-bytes non-tlh="0" tlh="131832" />
  <largest-consumer threadName="Thread-3" threadId="0000000002151B00" bytes="112440" />
</allocation-stats>
<gc-op id="652" type="mark" timems="9.496" contextid="649" timestamp="2018-10-17T22:22:17.877">
  <trace-info objectcount="36211" scancount="29326" scanbytes="915644" />
  <finalization candidates="90" enqueued="0" />
  <ownableSynchronizers candidates="1" cleared="0" />
  <references type="soft" candidates="1531" cleared="0" enqueued="0" dynamicThreshold="9" maxThreshold="32" />
  <references type="weak" candidates="790" cleared="0" enqueued="0" />
  <references type="phantom" candidates="12" cleared="0" enqueued="0" />
  <stringconstants candidates="3889" cleared="0"  />
</gc-op>
<gc-op id="653" type="classunload" timems="0.006" contextid="649" timestamp="2018-10-17T22:22:17.877">
  <classunload-info classloadercandidates="7" classloadersunloaded="0" classesunloaded="0" anonymousclassesunloaded="0" quiescems="0.000" setupms="0.005" scanms="0.000" postms="0.001" />
</gc-op>
<gc-op id="654" type="sweep" timems="6.619" contextid="649" timestamp="2018-10-17T22:22:17.883" />
<mem-info id="655" free="1679944096" total="1918566400" percent="87">
  <mem type="nursery" free="274268160" total="511442944" percent="53">
    <mem type="allocate" free="274268160" total="274268160" percent="100" />
    <mem type="survivor" free="0" total="237174784" percent="0" />
  </mem>
  <mem type="tenure" free="1405675936" total="1407123456" percent="99" micro-fragmented="46410" macro-fragmented="4479464">
    <mem type="soa" free="1335319968" total="1336767488" percent="99" />
    <mem type="loa" free="70355968" total="70355968" percent="100" />
  </mem>
  <remembered-set count="69" />
</mem-info>
<gc-op id="656" type="compact" timems="40.007" contextid="649" timestamp="2018-10-17T22:22:17.924">
  <compact-info movecount="32931" movebytes="1250728" reason="forced gc with compaction" />
</gc-op>
<gc-end id="657" type="global" contextid="649" durationms="56.910" usertimems="109.890" systemtimems="4.404" timestamp="2018-10-17T22:22:17.924" activeThreads="4">
  <mem-info id="658" free="1679993240" total="1918566400" percent="87">
    <mem type="nursery" free="274268160" total="511442944" percent="53">
      <mem type="allocate" free="274268160" total="274268160" percent="100" />
      <mem type="survivor" free="0" total="237174784" percent="0" />
    </mem>
    <mem type="tenure" free="1405725080" total="1407123456" percent="99">
      <mem type="soa" free="1335369112" total="1336767488" percent="99" />
      <mem type="loa" free="70355968" total="70355968" percent="100" />
    </mem>
    <remembered-set count="69" />
  </mem-info>
</gc-end>
<cycle-end id="659" type="global" contextid="649" timestamp="2018-10-17T22:22:17.924" />
<sys-end id="660" timestamp="2018-10-17T22:22:17.924" />
<exclusive-end id="661" timestamp="2018-10-17T22:22:17.924" durationms="57.257" />

These options are already enabled in the docker images for OpenJ9 produced at Adopt https://github.com/AdoptOpenJDK/openjdk-docker/blob/master/11/jdk/ubuntu/Dockerfile.openj9.releases.full At the same time, we are also planning to enable them by default at runtime when OpenJ9 detects that it is running in container (currently targeted for 0.12.0 release). See https://github.com/eclipse/openj9/issues/3226 for more info.

@siruslan hope this helps. Let us know if more information is required.

siruslan commented 6 years ago

@ashu-mehra, thank you for clarification! It helps. Few more comments:

I recently tested the app again with latest OpenJDK11 + OpenJ9 build https://adoptopenjdk.net/index.html?variant=openjdk11&jvmVariant=openj9, and it seems to be working fine now.

I got a chance to test this build too today.

jvm@node169036-openj9-test ~ $ new/jdk-11+28/bin/java -XX:+IdleTuningCompactOnIdle -XX:+IdleTuningGcOnIdle -XX:IdleTuningMinIdleWaitTime=10 -Xmx4g -Xms32m -jar app.jar 1
09:52:55 -> Init: 32M Used: 6M Committed: 32M Max: 4096M 
09:52:58 -> Init: 32M Used: 7M Committed: 32M Max: 4096M 

09:53:01 -> Init: 32M Used: 583M Committed: 660M Max: 4096M 
09:53:04 -> Init: 32M Used: 1203M Committed: 1592M Max: 4096M 
total time = 5676, ms
done
09:53:07 -> Init: 32M Used: 1260M Committed: 1638M Max: 4096M 
09:53:10 -> Init: 32M Used: 1260M Committed: 1638M Max: 4096M 
09:53:13 -> Init: 32M Used: 1260M Committed: 1638M Max: 4096M 
09:53:16 -> Init: 32M Used: 1260M Committed: 1638M Max: 4096M 
09:53:19 -> Init: 32M Used: 1260M Committed: 1638M Max: 4096M  
...
09:53:55 -> Init: 32M Used: 1260M Committed: 1638M Max: 4096M 
09:53:58 -> Init: 32M Used: 1260M Committed: 1638M Max: 4096M 
09:54:01 -> Init: 32M Used: 225M Committed: 1638M Max: 4096M 
09:54:04 -> Init: 32M Used: 225M Committed: 1638M Max: 4096M 
09:54:07 -> Init: 32M Used: 225M Committed: 1638M Max: 4096M 
09:54:10 -> Init: 32M Used: 225M Committed: 1638M Max: 4096M 
09:54:13 -> Init: 32M Used: 225M Committed: 1638M Max: 4096M 
...
09:55:40 -> Init: 32M Used: 225M Committed: 1638M Max: 4096M

As we can notice Committed stays the same and it does not go down while Used decreases automatically, I use MemoryMXBean for checking these values. Is it related to incorrect interpretation you mentioned?

we are also planning to enable them by default at runtime when OpenJ9 detects that it is running in container (currently targeted for 0.12.0 release)

That's good! However I'm afraid that 'container' definition means Docker only in this context... What about LXD and OpenVZ? What is the main concern to enable it by default for any environment?

One more question to clarify - how do you determine idle state? Is it CPU usage based only? It will be useful for the community to get a clear description of an idle state.

Thanks

ashu-mehra commented 6 years ago

Committed stays the same and it does not go down while Used decreases automatically,

I think this is expected behavior. Used decreases because we do a GC cycle during idle, so the heap usage goes down. But the committed heap remains the same, because when we de-commit the heap, we don't update internal GC structures at all. De-commit only involves calling madvise(...,MADV_DONTNEED) on the free heap pages. Physical pages backing those heap pages may be reclaimed by OS, but internally JVM does not alter its structures that track committed heap. That's the reason committed value returned by MemoryMXBean does not change as it is based on JVM's internal structures. Only way, as of now, to measure the effect is to monitor resident memory size RES using top.

However I'm afraid that 'container' definition means Docker only in this context... What about LXD and OpenVZ?

You are right. Currently the mechanism added in OpenJ9 for container detection is specific to Docker. We have not yet tried it on LXD/OpenVZ or any other container technology. Reason being Docker is the most prominent container technology. If there is requirement to support LXD/OpenVZ, we can definitely try to extend OpenJ9's container awareness to include these environments. I would really appreciate if you can open an issue for this.

What is the main concern to enable it by default for any environment?

An obvious concern that comes to mind is de-committing the free heap pages may cause page faults when the application tries to allocate more objects. However, in our performance runs using DayTrader or AcmeAir benchmarks, we didn't find any performance degradation. I should add the impact of such optimizations is highly dependent on the workload.

how do you determine idle state?

The mechanism to determine idle state depends on Sampler thread's state logic which is present here: https://github.com/eclipse/openj9/blob/f418e63bc8a1879dc00c3ea5dfe231478f703c26/runtime/compiler/control/HookedByTheJit.cpp#L5922 The logic depends on how many active threads are there in the system, which in turn is computed by checking a particular flag in the thread's data structure struct J9VMThread. As far as I understand, this flag indicates the thread is accessing VM's internal data structures and/or executing bytecodes. It has been fairly accurate and lightweight heuristic to compute the idle state. https://github.com/eclipse/openj9/blob/f418e63bc8a1879dc00c3ea5dfe231478f703c26/runtime/compiler/control/HookedByTheJit.cpp#L6453

I did a lightning talk on OpenJ9 community call on Idle Tuning in OpenJ9 where I covered this a bit. You may want to take a loot at it https://www.youtube.com/watch?v=qtJrlpytoj4. Identification of idle state is covered from 6:10 onwards. @siruslan hope you find this useful. Thanks!

siruslan commented 6 years ago

But the committed heap remains the same, because when we de-commit the heap, we don't update internal GC structures at all. De-commit only involves calling madvise(...,MADV_DONTNEED) on the free heap pages. Physical pages backing those heap pages may be reclaimed by OS, but internally JVM does not alter its structures that track committed heap.

Is it a safe approach? I mean, is it possible that two different processes try to reserve the same memory pages in parallel?

Only way, as of now, to measure the effect is to monitor resident memory size RES using top.

Ok, understood. First of all I appreciate the invested efforts into the current implementation. However from other side, most likely such approach will introduce extra complexity for understanding and tracking the real memory usage. As far as I know Shenandoah and G1 GCs can handle it properly and show the real usage inside JVM after decommitting unused memory.

An obvious concern that comes to mind is de-committing the free heap pages may cause page faults when the application tries to allocate more objects.

Do you mean the situation when JVM decommits unused heap and exactly at the same time application requests more memory?

Currently the mechanism added in OpenJ9 for container detection is specific to Docker. We have not yet tried it on LXD/OpenVZ or any other container technology. Reason being Docker is the most prominent container technology.

I believe at least it will be good to mention Docker in the description, not just generic world "containers". It will definitely help to avoid confusions for people who know or use different kind of container technologies. Moreover it's very dynamic market and we will see more implementations of container runtimes.

The logic depends on how many active threads are there in the system, which in turn is computed by checking a particular flag in the thread's data structure struct J9VMThread. As far as I understand, this flag indicates the thread is accessing VM's internal data structures and/or executing bytecodes. It has been fairly accurate and lightweight heuristic to compute the idle state.

Thanks for clarification, I will talk to experts in this area and will provide a feedback a little bit later.

I did a lightning talk on OpenJ9 community call on Idle Tuning in OpenJ9 where I covered this a bit. You may want to take a loot at it https://www.youtube.com/watch?v=qtJrlpytoj4.

The video helped very much! Thank you for your efforts, @ashu-mehra.

ashu-mehra commented 6 years ago

Is it a safe approach? I mean, is it possible that two different processes try to reserve the same memory pages in parallel?

I believe this should be taken care by the OS itself.

However from other side, most likely such approach will introduce extra complexity for understanding and tracking the real memory usage. As far as I know Shenandoah and G1 GCs can handle it properly and show the real usage inside JVM after decommitting unused memory.

I think it was a conscious decision to avoid changing internal structures when de-committing the free heap, to keep the overhead of this mechanism as low as possible. @amicic do you have any thing more to add to it?

Do you mean the situation when JVM decommits unused heap and exactly at the same time application requests more memory?

Not exactly. I am referring to the situation when the application becomes active again after the idle cycle. It can allocate objects such that the JVM would have to use pages de-committed in previous idle cycle. Accessing such pages would cause the page faults, which would not have happened if the JVM had not de-committed the pages.

I believe at least it will be good to mention Docker in the description, not just generic world "containers".

Thanks for the suggestion. We will surely keep that in mind and update the descriptions accordingly.

siruslan commented 5 years ago

Guys, I'm talking about vertical scaling with Java at Geneva DevOps event this week and would like to mention OpenJ9 in this context as well.

What is the best or recommended way to quickly test vertical scaling with OpenJ9 and make sure that it works properly?

Please also let me know if you want to highlight anything specifics, for example recent updated related to this feature if any.

@DanHeidinga @ashu-mehra @amicic @dinogun

Thanks

DanHeidinga commented 5 years ago

What is the best or recommended way to quickly test vertical scaling with OpenJ9 and make sure that it works properly?

Please also let me know if you want to highlight anything specifics, for example recent updated related to this feature if any.

FYI @vijaysun-omr @mpirvu

ashu-mehra commented 5 years ago

I'm talking about vertical scaling with Java

@siruslan can you please elaborate on what exactly are you expecting when you mention vertical scaling with Java? Is it the ability to adjust heap size, cpus and other resources at runtime?

mpirvu commented 5 years ago

As explained here https://github.com/eclipse/openj9/issues/2312#issuecomment-431453020, OpenJ9 does not change the shape of the Java heap when idling (it does not de-commit memory); instead, OpenJ9 provides a hint to the OS that the freed pages from the heap are not going to be needed anytime soon and therefore they should be evicted from physical memory.

To test vertical scaling with respect to memory in OpenJ9 I recommend to monitor the resident-set-size (RSS) of a Java process with a script like this:

while true; do
   sleep 1
   ps -orss --no-headers --pid $1
done

Doing so for the application mentioned here https://github.com/eclipse/openj9/issues/2312#issue-337865166 and using the following options: -Xms32M -Xmx2g -Xjit:waitTimeToEnterDeepIdleMode=10000 -XX:+IdleTuningCompactOnIdle -XX:+IdleTuningGcOnIdle -XX:IdleTuningMinIdleWaitTime=10 I see the following RSS values

37528
37528
39576
37528
271308  <-- application started to allocate memory
529680
777736
1043460
1298904
1298904
1298904
1298904
1298904
1298904
1298904
1298904
1298904
1298904
1298904
1298904
1298904
1298904
1298904
371620    <-- Idle detected and heap pages removed from RSS
371620

Repeating the same experiment for HotSpot (Java12) with options: -Xms32M -Xmx2g -XX:+UseG1GC -XX:G1PeriodicGCInterval=3k I see the following values:

36628
36628
129376 <-- application started to allocate memory
539132
974652
1487220
1528592
1528592
1528592
1528592
1554004
1554004
1554004
1554044
1554308
1556360
81820    <-- Part of heap decommitted 
81820

Some notes:

OpenJ9 has a lower RSS peak, but during idle keeps more physical memory around
For OpenJ9 I added -Xjit:waitTimeToEnterDeepIdleMode=10000 to speedup the process of deep idle detection (default is to wait 50000 ms). Free heap pages are evicted from RSS after max(waitTimeToEnterDeepIdleMode, IdleTuningMinIdleWaitTime*1000) ms.

siruslan commented 5 years ago

Guys, thank you for your assistance! It's helpful. I believe I got enough details now and are able to explain people how it works behind the scene as well as how to track the real usage.

ashu-mehra commented 5 years ago

@siruslan I am trying to summarise the discussion on slack https://openj9.slack.com/archives/C8312LCV9/p1568771742039800 related to this issue. Please feel free to correct me if anything is wrong or add your comment here:

When running the application https://github.com/jelastic/java-vertical-scaling-test/blob/master/src/com/jelastic/verticalscaling/Load.java with following parameters -XX:+IdleTuningCompactOnIdle -XX:+IdleTuningGcOnIdle -XX:IdleTuningMinIdleWaitTime=1 -Xjit:waitTimeToEnterDeepIdleMode=1000, memory usage reported by container_memory_usage_bytes (I guess in Grafana dashboard which corresponds to corresponds to physical memory usage of the container) does not go down during idle cycle.

I used following command to reproduce the issue: java -XX:+IdleTuningCompactOnIdle -XX:+IdleTuningGcOnIdle -XX:IdleTuningMinIdleWaitTime=1 -Xmx2G -Xjit:verbose={compilePerformance},verbose={jitState},vlog=jit.log,waitTimeToEnterDeepIdleMode=1000 -Xverbosegclog -jar app.jar 100 2

Reason being the wait time to enter deep idle is set to 1 sec which effectively causes JVM to go into idle phase within 5 secs of inactivity. Now the test program sleeps for 60 secs before clearing the list here https://github.com/jelastic/java-vertical-scaling-test/blob/74e1a3dbe3b7c9d8a10d72fb51987734b4b2e420/src/com/jelastic/verticalscaling/Load.java#L81. Therefore, by the time idle tuning kicks in, list is still active and there is not much in the heap to be gc-ed and free pages to be released. However, if I clear the list immediately before the program sleeps for 60 secs, then the memory usage drops drastically when JVM detects idle state.

I should also point out that during my runs I noticed many times JVM does not come out of idle state during subsequent iterations of the loop because there is not much activity happening in the loop. As a result, memory usage does not drop in subsequent iterations. If the code is updated to make it cpu intensive in the active period, it would hopefully solve this problem as well.

fyi - @mpirvu

siruslan commented 4 years ago

The interest to java elasticity is keep growing as more people get familiar with the recent related improvements introduced in OpenJDK. Is there any way to configure OpenJ9 and get the same (or at least similar) results without the need to change the simple test app mentioned earlier in this thread? There is an overview presentation: State of Java Elasticity. Tuning Java Efficiency. The slide related to OpenJ9 looks weak at the moment :(... I believe we can improve it.

eclipse-openj9 / openj9

Inconsistent behaviour with -XX:+IdleTuningGcOnIdle, mem not released back to OS on Idle everytime. #2312