eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 722 forks source link

Portable SCC: Compressed Refs #7965

Open dsouzai opened 4 years ago

dsouzai commented 4 years ago

There are two approaches that were brought up in the Portable SCC discussion regarding how to deal with the potential for the compressed refs shift changing with the heap size.

  1. Have the JIT assume that the compressed shift might be 4. The generated code then loads the shift value into a register. This load can then be relocated.
  2. Fix the shift value to 3 if the JVM is going to use AOT code.
DanHeidinga commented 4 years ago

fyi @dmitripivkine @amicic

Can one of you comment on what would be affected in the GC (and rest of the runtime) by forcing the CR shift to always be 3? How would this affect the heap size? Occupancy? More wasted space? etc

dsouzai commented 4 years ago

also fyi @fjeremic @andrewcraik @gita-omr @knn-k to give a codegen perspective.

amicic commented 4 years ago

For a large part GC would not be affected (roots and metastructures like rembembered set do not use CR). The biggest performance impact, I geuess, would come from jitted code, which is more in JIT folks to comment, like how an unnecessary shift for <4GB heap would affect performance.

amicic commented 4 years ago

I don't see obvious footprint implications.

amicic commented 4 years ago

Don't remember if this was discussed already, but what I struggle more is, jitted code (non)portability in unified CR/nonCR VM.

dsouzai commented 4 years ago

jitted code (non)portability in unified CR/nonCR VM.

This was brought up in the discussion; running in CR / non-CR results in a different SCC. I suppose theoretically the same SCC could be used to store both CR and non-CR versions of a compiled method, but that's better discussed in another issue (which I can open if we feel it's a discussion worth having).

DanHeidinga commented 4 years ago

Don't remember if this was discussed already, but what I struggle more is, jitted code (non)portability in unified CR/nonCR VM.

The CR-ness is encoded in the cache name today so that they are forced to be separate caches. I expect the initial unified CR/nonCR VM will still use separate caches for the initial approach.

dmitripivkine commented 4 years ago

fyi @dmitripivkine @amicic

Can one of you comment on what would be affected in the GC (and rest of the runtime) by forcing the CR shift to always be 3? How would this affect the heap size? Occupancy? More wasted space? etc

fjeremic commented 4 years ago

From the codegen perspective proposed solution 1. is going to be less performant and harder to functionally implement correctly. 2. will give most performance but least flexibility.

harryyu1994 commented 4 years ago

@fjeremic @andrewcraik Do you guys have any issues/concerns with moving forward with solution 2 here (fixing the shift value to 3 for portable AOT).

Summary of the solution:

FYI @mpirvu @vijaysun-omr @ymanton @dsouzai

dsouzai commented 4 years ago

the compressedref shift will be fixed to 3 (3 for <= 32gb and 4 for > 32gb).

Worth noting that there is a point when even shift 4 won't work, eg if the heap is so big that we have to run without compressedrefs; I don't know if the JVM will still explicitly require the user to pass in -Xnocompressedrefs once we have one build that can do both.

If we can't use compressedrefs, what should the portable AOT story be? Should we generate a SCC with the default CPU features & nocompressedrefs (thereby mandating that all future layers must have compressedrefs disabled)?

harryyu1994 commented 4 years ago

If we can't use compressedrefs, what should the portable AOT story be? Should we generate a SCC with the default CPU features & nocompressedrefs (thereby mandating that all future layers must have compressedrefs disabled)?

It sounds reasonable to me as we are giving the same treatment to shift 3/shift 4/nocompressedrefs. This shouldn't be that bad if most of the use cases fall under shift 3.

pshipton commented 4 years ago

Note compressedrefs and non-compressedrefs don't share any cache, there are different cache files for these atm.

vijaysun-omr commented 4 years ago

Is it possible to estimate the impact of applying a constant shift by 3 in compiled code ? In general I am supportive of this proposed direction that you are taking but I wanted to get a sense for how much slower we would be making portable AOT code because of compressed refs shift. My expectation/hope would be that the overhead is'nt much more than 5% but this is what I'm asking to be measured.

harryyu1994 commented 4 years ago

Is it possible to estimate the impact of applying a constant shift by 3 in compiled code ? In general I am supportive of this proposed direction that you are taking but I wanted to get a sense for how much slower we would be making portable AOT code because of compressed refs shift. My expectation/hope would be that the overhead is'nt much more than 5% but this is what I'm asking to be measured.

Daytrader7

CompressedShift = 0

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=2850.45     min=2759.30     max=2900.90     stdDev=45.5     maxVar=5.13%    confInt=1.07%   samples= 8
Intermediate results:
Run 0   187.3   2575.9  2890.0  2861.1  Avg=2861        CPU=113492 ms  Footprint=712364 KB
Run 1   200.6   2573.6  2887.4  2867.5  Avg=2868        CPU=108388 ms  Footprint=700740 KB
Run 2   221.0   2579.6  2901.8  2759.3  Avg=2759        CPU=112528 ms  Footprint=699352 KB
Run 3   222.1   2628.1  2830.4  2892.9  Avg=2893        CPU=107786 ms  Footprint=706204 KB
Run 4   180.4   2628.9  2903.8  2830.5  Avg=2830        CPU=108510 ms  Footprint=706704 KB
Run 5   226.5   2598.6  2705.5  2867.6  Avg=2868        CPU=112368 ms  Footprint=713240 KB
Run 6   221.2   2647.2  2837.1  2900.9  Avg=2901        CPU=110313 ms  Footprint=698736 KB
Run 7   231.8   2619.8  2928.2  2823.8  Avg=2824        CPU=110404 ms  Footprint=707608 KB
CompTime        avg=110473.62   min=107786.00   max=113492.00   stdDev=2150.7   maxVar=5.29%    confInt=1.30%   samples= 8
Footprint       avg=705618.50   min=698736.00   max=713240.00   stdDev=5599.8   maxVar=2.08%    confInt=0.53%   samples= 8

CompressedShift = 3

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=2819.82     min=2776.80     max=2861.30     stdDev=29.0     maxVar=3.04%    confInt=0.69%   samples= 8
Intermediate results:
Run 0   163.2   2442.4  2795.4  2776.8  Avg=2777        CPU=137886 ms  Footprint=651608 KB
Run 1   144.9   2350.9  2826.5  2847.1  Avg=2847        CPU=137913 ms  Footprint=636708 KB
Run 2   152.9   2429.0  2857.6  2826.9  Avg=2827        CPU=131592 ms  Footprint=637768 KB
Run 3   174.9   2363.9  2790.5  2832.9  Avg=2833        CPU=140504 ms  Footprint=642980 KB
Run 4   161.6   2433.0  2810.8  2803.1  Avg=2803        CPU=132384 ms  Footprint=632412 KB
Run 5   139.9   2409.7  2819.2  2861.3  Avg=2861        CPU=132907 ms  Footprint=649168 KB
Run 6   177.9   2467.1  2801.9  2787.1  Avg=2787        CPU=137302 ms  Footprint=636512 KB
Run 7   178.0   2431.0  2764.5  2823.4  Avg=2823        CPU=133845 ms  Footprint=638280 KB
CompTime        avg=135541.62   min=131592.00   max=140504.00   stdDev=3256.5   maxVar=6.77%    confInt=1.61%   samples= 8
Footprint       avg=640679.50   min=632412.00   max=651608.00   stdDev=6681.6   maxVar=3.04%    confInt=0.70%   samples= 8

AcmeAir in Docker

Shift0:

run0: summary = 2938081 in  600s = 4896.7/s Avg:   1 Min:   0 Max:  891 Err:   0 (0.00%)
run1: summary = 3136727 in  600s = 5227.7/s Avg:   1 Min:   0 Max:  129 Err:   0 (0.00%)
run2: summary = 3147370 in  600s = 5245.4/s Avg:   1 Min:   0 Max:  109 Err:   0 (0.00%)
run3: summary = 3139280 in  600s = 5232.0/s Avg:   1 Min:   0 Max:  117 Err:   0 (0.00%)
run4: summary = 3133830 in  600s = 5222.8/s Avg:   1 Min:   0 Max:  79 Err:   0 (0.00%)
run5: summary = 3136712 in  600s = 5227.7/s Avg:   1 Min:   0 Max:  156 Err:   0 (0.00%)

5231.12 
Shift3:

run0: summary = 2964754 in  600s = 4941.1/s Avg:   1 Min:   0 Max:  260 Err:   0 (0.00%)
run1: summary = 3137234 in  600s = 5228.3/s Avg:   1 Min:   0 Max:  124 Err:   0 (0.00%)
run2: summary = 3126874 in  600s = 5211.3/s Avg:   1 Min:   0 Max:  110 Err:   0 (0.00%)
run3: summary = 3139452 in  600s = 5232.2/s Avg:   1 Min:   0 Max:  64 Err:   0 (0.00%)
run4: summary = 3134675 in  600s = 5224.3/s Avg:   1 Min:   0 Max:  100 Err:   0 (0.00%)
run5: summary = 3139328 in  600s = 5232.1/s Avg:   1 Min:   0 Max:  113 Err:   0 (0.00%)

5225.64

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:none -Xnoaot -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=2997.16     min=2976.00     max=3022.10     stdDev=13.9     maxVar=1.55%    confInt=0.31%   samples= 8
Intermediate results:
Run 0   161.6   2661.4  3026.7  3022.1  Avg=3022        CPU=131266 ms  Footprint=597944 KB
Run 1   175.7   2600.5  3029.5  2990.0  Avg=2990        CPU=127742 ms  Footprint=594136 KB
Run 2   178.5   2622.8  3002.0  2984.6  Avg=2985        CPU=130536 ms  Footprint=602824 KB
Run 3   161.8   2686.5  3003.7  3000.8  Avg=3001        CPU=129596 ms  Footprint=596588 KB
Run 4   131.0   2617.7  2820.5  2976.0  Avg=2976        CPU=143361 ms  Footprint=603160 KB
Run 5   157.4   2657.5  3016.9  2999.4  Avg=2999        CPU=129978 ms  Footprint=604736 KB
Run 6   175.0   2656.1  2978.4  3001.4  Avg=3001        CPU=130472 ms  Footprint=592384 KB
Run 7   192.8   2599.7  3042.6  3003.0  Avg=3003        CPU=130239 ms  Footprint=613256 KB
CompTime        avg=131648.75   min=127742.00   max=143361.00   stdDev=4843.3   maxVar=12.23%   confInt=2.46%   samples= 8
Footprint       avg=600628.50   min=592384.00   max=613256.00   stdDev=6774.0   maxVar=3.52%    confInt=0.76%   samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:none -Xnoaot -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=2933.98     min=2877.50     max=2974.10     stdDev=29.9     maxVar=3.36%    confInt=0.68%   samples= 8
Intermediate results:
Run 0   173.7   2609.1  2960.8  2918.4  Avg=2918        CPU=133268 ms  Footprint=599836 KB
Run 1   179.3   2641.4  2952.0  2920.2  Avg=2920        CPU=128775 ms  Footprint=597212 KB
Run 2   154.8   2589.6  2955.9  2956.3  Avg=2956        CPU=131750 ms  Footprint=606756 KB
Run 3   191.9   2553.4  2955.2  2945.6  Avg=2946        CPU=129687 ms  Footprint=597068 KB
Run 4   149.1   2640.5  2965.7  2974.1  Avg=2974        CPU=129883 ms  Footprint=600332 KB
Run 5   193.3   2638.6  2941.3  2927.3  Avg=2927        CPU=127431 ms  Footprint=603852 KB
Run 6   168.7   2530.4  2892.4  2877.5  Avg=2878        CPU=145817 ms  Footprint=596084 KB
Run 7   176.7   2569.6  2945.2  2952.4  Avg=2952        CPU=134123 ms  Footprint=596880 KB
CompTime        avg=132591.75   min=127431.00   max=145817.00   stdDev=5798.9   maxVar=14.43%   confInt=2.93%   samples= 8
Footprint       avg=599752.50   min=596084.00   max=606756.00   stdDev=3809.2   maxVar=1.79%    confInt=0.43%   samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=2964.54     min=2933.90     max=3002.30     stdDev=25.2     maxVar=2.33%    confInt=0.57%   samples= 8
Intermediate results:
Run 0   220.7   2656.6  2940.8  2933.9  Avg=2934        CPU=103509 ms  Footprint=712676 KB
Run 1   228.4   2674.1  2920.4  2950.9  Avg=2951        CPU=105973 ms  Footprint=719892 KB
Run 2   223.8   2727.0  2960.8  2947.2  Avg=2947        CPU=103124 ms  Footprint=704384 KB
Run 3   215.6   2704.4  2978.5  2978.7  Avg=2979        CPU=103663 ms  Footprint=709576 KB
Run 4   235.9   2666.1  2967.8  3002.3  Avg=3002        CPU=103964 ms  Footprint=710316 KB
Run 5   218.4   2676.8  2964.8  2997.3  Avg=2997        CPU=101415 ms  Footprint=704660 KB
Run 6   176.1   2719.4  2953.1  2958.0  Avg=2958        CPU=103691 ms  Footprint=726336 KB
Run 7   214.4   2654.4  2957.3  2948.0  Avg=2948        CPU=106512 ms  Footprint=714952 KB
CompTime        avg=103981.38   min=101415.00   max=106512.00   stdDev=1608.1   maxVar=5.03%    confInt=1.04%   samples= 8
Footprint       avg=712849.00   min=704384.00   max=726336.00   stdDev=7481.4   maxVar=3.12%    confInt=0.70%   samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=2889.20     min=2842.80     max=2943.50     stdDev=37.0     maxVar=3.54%    confInt=0.86%   samples= 8
Intermediate results:
Run 0   181.7   2536.2  2979.7  2933.4  Avg=2933        CPU=123566 ms  Footprint=693680 KB
Run 1   185.0   2523.2  2887.1  2910.4  Avg=2910        CPU=128682 ms  Footprint=640464 KB
Run 2   178.7   2494.4  2877.1  2943.5  Avg=2944        CPU=128197 ms  Footprint=637768 KB
Run 3   175.7   2553.8  2879.9  2889.7  Avg=2890        CPU=129298 ms  Footprint=639216 KB
Run 4   157.6   2498.3  2862.5  2875.5  Avg=2876        CPU=123893 ms  Footprint=640976 KB
Run 5   187.6   2497.7  2868.2  2842.8  Avg=2843        CPU=125097 ms  Footprint=632568 KB
Run 6   178.9   2420.8  2828.6  2865.9  Avg=2866        CPU=127235 ms  Footprint=637264 KB
Run 7   173.4   2305.5  2862.9  2852.4  Avg=2852        CPU=119157 ms  Footprint=636492 KB
CompTime        avg=125640.62   min=119157.00   max=129298.00   stdDev=3410.0   maxVar=8.51%    confInt=1.82%   samples= 8
Footprint       avg=644803.50   min=632568.00   max=693680.00   stdDev=19923.9  maxVar=9.66%    confInt=2.07%   samples= 8
harryyu1994 commented 4 years ago

I have spent some time coming up with an implementation and here's the whole story:

First, recap on the original compressed shift design:

I proceeded to implement this, and have found some limitations with our existing infrastructure in the codebase:

Due to the limitations, I'm proposing an alternative solution

@vijaysun-omr @mpirvu @dsouzai Let me know if there are any concerns with the alternative solution.

DanHeidinga commented 4 years ago

@dmitripivkine Does the CR shift change the object alignment requirements? Does forcing a 3bit shift mean small heaps will waste more space and therefore incur higher gc overhead?

mpirvu commented 4 years ago

if the compressed shift value is 4 then we will generate a warning message to the user that the heap may be too large for the portableSharedCache option to work.

I assume we will continue to generate AOT which is portable from the processor point of view.

dmitripivkine commented 4 years ago

@dmitripivkine Does the CR shift change the object alignment requirements? Does forcing a 3bit shift mean small heaps will waste more space and therefore incur higher gc overhead?

only for 4-bit shift (required 16 bytes alignment for objects). All other cases are covered by minimum heap object alignment to 8 bytes

harryyu1994 commented 4 years ago

I assume we will continue to generate AOT which is portable from the processor point of view.

Yes, we will always use the portable processor feature set when -XX:+PortableSharedCache is specified.

dmitripivkine commented 4 years ago

@DanHeidinga @vijaysun-omr Moving question here: https://github.com/eclipse/omr/pull/5436 force any run in container to use shift 3 (except it needs to use shift 4). It obviously prevents to use most performant shift 0 for small container applications. Would you please confirm this is desired behaviour?

harryyu1994 commented 4 years ago

I think in containers the plan is to sacrifice a bit of performance in exchange for maximum portability. I have run some experiment comparing shift0 and shift3 and didn't see a significant throughput drop. I'll leave the decision to Vijay @vijaysun-omr though.

DanHeidinga commented 4 years ago

Moving question here: eclipse/omr#5436 force any run in container to use shift 3 (except it needs to use shift 4). It obviously prevents to use most performant shift 0 for small container applications. Would you please confirm this is desired behaviour?

My understanding is that this is the expected behaviour only when -XX:+PortableSharedCache is specified and in that case we accept the tradeoff for better portability.

harryyu1994 commented 4 years ago

My understanding is that this is the expected behaviour only when -XX:+PortableSharedCache is specified and in that case we accept the tradeoff for better portability.

Yes that is correct. But in containers the PortableSharedCache feature is enabled by default. In containers, the portable processor feature set will be used by default for AOT compilations unless disabled by -XX:-PortableSharedCache. The question here is whether we want to also have the shift set to 3 by default in containers.

mpirvu commented 4 years ago

The question here is whether we want to also have the shift set to 3 by default in containers.

I would say yes, but only if AOT is enabled.

vijaysun-omr commented 4 years ago

The shift by 3 code is only generated for an AOT compilation in containers.

So, in the unaffected category are : 1) JIT compilations inside or outside containers and 2) AOT compilations outside containers.

This is a conscious choice being made in the same philosophy as the change to use "portable processor feature set", i.e. we want to make the code portable at a small performance cost in containers (in the case of compressed refs shift, the cost was negligible as @harryyu1994 measured). Since AOT compilations can (and are) usually recompiled as JIT compilations if they are deemed important for peak throughput (since this is not the first way and also nor the most significant way that AOT compilations are worse than JIT compilations), the steady state impact won't even be as much as measured in the AOT experiments with and without portability changes. During startup and rampup phases when AOT code is used heavily, these minor performance differences are unlikely to be a big enough deal to compromise on portability in containers.

harryyu1994 commented 4 years ago

The shift by 3 code is only generated for an AOT compilation in containers.

So, in the unaffected category are : 1) JIT compilations inside or outside containers and 2) AOT compilations outside containers.

This is a conscious choice being made in the same philosophy as the change to use "portable processor feature set", i.e. we want to make the code portable at a small performance cost in containers (in the case of compressed refs shift, the cost was negligible as @harryyu1994 measured). Since AOT compilations can (and are) usually recompiled as JIT compilations if they are deemed important for peak throughput (since this is not the first way and also nor the most significant way that AOT compilations are worse than JIT compilations), the steady state impact won't even be as much as measured in the AOT experiments with and without portability changes. During startup and rampup phases when AOT code is used heavily, these minor performance differences are unlikely to be a big enough deal to compromise on portability in containers.

Correct me if I'm wrong, but I thought the JIT compilations inside containers would also have to use shift3 if we made the AOT compilations shift by 3. So JIT compilations inside containers are affected (though we didn't see a throughput drop in my experiment when comparing AOT+JIT shift0 vs. AOT+JIT shift3).

harryyu1994 commented 4 years ago

The question here is whether we want to also have the shift set to 3 by default in containers.

I would say yes, but only if AOT is enabled.

May not be possible to check for whether AOT is enabled this early.

enum INIT_STAGE {
    PORT_LIBRARY_GUARANTEED,           0
    ALL_DEFAULT_LIBRARIES_LOADED,   1
    ALL_LIBRARIES_LOADED,       2
    DLL_LOAD_TABLE_FINALIZED,   3  Consume JIT specific X options
    VM_THREADING_INITIALIZED,       4
    HEAP_STRUCTURES_INITIALIZED,    5
    ALL_VM_ARGS_CONSUMED,       6

The shift is set at ALL_LIBRARIES_LOADED, very early into the initialization.

mpirvu commented 4 years ago

Looking at the code https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L34-L66 I see that vm->sharedCacheAPI->sharedCacheEnabled is set very early and SCC options are also parsed very early. But then there is this piece of code in the same function: https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L304-L328 which seems to deal with -Xshareclasses:none option. I wonder why can't this piece of code be grouped with the previous one. That way we would know very early whether SCC is 'likely' going to be used. @hangshao0

vijaysun-omr commented 4 years ago

@harryyu1994 Discussing with @mpirvu some more, I feel that we need more data points if we are going to slow down JITed code (in addition to AOTed code) inside containers. Could you please run SPECjbb2015 (please ask Piyush if you need help with accessing a setup for it) and maybe SPECjbb2005 (that is much easier to run) and check what the throughput overhead is ?

Additionally, the overhead of the shift would be platform dependent and so if one wanted to take a design decision for all platforms, the effect of the shift ought to be measured on the other platforms first.

mpirvu commented 4 years ago

I would also add quarkus throughput experiments since quarkus is more likely to be run in containers.

hangshao0 commented 4 years ago

But then there is this piece of code in the same function: https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L304-L328 which seems to deal with -Xshareclasses:none option. I wonder why can't this piece of code be grouped with the previous one. That way we would know very early whether SCC is 'likely' going to be used.

Looking at the code, what it does is to unload the SCC dll if -Xshareclasses:none presents. I guess that is the reason why it is done in stage DLL_LOAD_TABLE_FINALIZED. Once the SCC dll in unloaded, all SCC related functionalities will be inactive.

mpirvu commented 4 years ago

unload the SCC dll if -Xshareclasses:none presents

This means that we load the SCC dll before checking the command line options. Be that as it may, we could add another check for -Xshareclasses:none when SCC options are parsed.

hangshao0 commented 4 years ago

we could add another check for -Xshareclasses:none when SCC options are parsed.

Yes. It looks fine to me if another check for -Xshareclasses:none is added in the block L34 to L66.

mpirvu commented 4 years ago

Quarkus+CRUD on x86 loses 0.9% in throughput when we force a shift3 instead on shift0 for compresssedrefs

Stats for rest-crud-quarkus-openj9:j11 with JAVA_OPTS=-Xms128m -Xmx128m -Xshareclasses:none -XXgc:forcedShiftingCompressionAmount=3
Throughput:     avg=12040.20    min=11931.00    max=12111.10    stdDev=57.7     maxVar=1.51%    confInt=0.28%   samples=10
Footprint:      avg=123.01      min=105.90      max=129.90      stdDev=6.7      maxVar=22.66%   confInt=3.17%   samples=10

Stats for rest-crud-quarkus-openj9:j11 with JAVA_OPTS=-Xms128m -Xmx128m -Xshareclasses:none -XXgc:forcedShiftingCompressionAmount=0
Throughput:     avg=12140.48    min=12065.70    max=12209.50    stdDev=48.3     maxVar=1.19%    confInt=0.23%   samples=10
Footprint:      avg=125.31      min=120.40      max=129.30      stdDev=2.8      maxVar=7.39%    confInt=1.28%   samples=10
harryyu1994 commented 4 years ago

SPECjbb2015GMR multi_2grp_gencon

Shift3

Job ID max_jOPS critical_jOPS hbIR_max hbIR_settled
9962189 12177 4435 13838 13099
9962190 13688 5824 19279 16099
9962191 14006 5749 16099 13449
9962192 11277 4417 13587 13415
means 12787 5106.25 15700.75 14015.5
medians 12932.5 5092 14968.5 13432
confidence_interval 0.15982393515568 0.24493717290884 0.26746374600271 0.1586869314041
min 11277 4417 13587 13099
max 14006 5824 19279 16099
stddev 1284.5183273637 786.11592656554 2639.4603457273 1397.9111798203

Shift0

Job ID max_jOPS critical_jOPS hbIR_max hbIR_settled
9962180 12718 4750 16099 14334
9962181 14972 5971 16099 13449
9962182 13040 4531 16099 13795
9962183 14167 5686 16099 13449
means 13724.25 5234.5 16099 13756.75
medians 13603.5 5218 16099 13622
confidence_interval 0.12035581689665 0.21319070129398 0 0.048339382455907
min 12718 4531 16099 13449
max 14972 5971 16099 14334
stddev 1038.2107605555 701.41214702912   417.97158994362

Added more runs

Shift 3

Job ID max_jOPS critical_jOPS hbIR_max hbIR_settled
9994492 13253 5439 16566 15678
9994493 13445 5534 14939 14322
9994494        
9994495 15133 6707 16099 13449
means 13943.666666667 5893.3333333333 15868 14483
medians 13445 5534 16099 14322
confidence_interval 0.15737812859215 0.2542198958441 0.11199389127022 0.16451397337762
min 13253 5439 14939 13449
max 15133 6707 16566 15678
stddev 1034.4570234347 706.25514747387 837.73683218538 1123.1878738662

Shift 0

Job ID max_jOPS critical_jOPS hbIR_max hbIR_settled
9994483 12508 5022 13449 13024
9994484 13146 5101 14939 14322
9994485 16099 6128 16099 13449
9994486 13362 5131 16099 13795
means 13778.75 5345.5 15146.5 13647.5
medians 13254 5116 15519 13622
confidence_interval 0.18344971565841 0.1558672596321 0.13202131493524 0.064024675274026
min 12508 5022 13449 13024
max 16099 6128 16099 14322
stddev 1588.754097818 523.68852065581 1256.8578545988 549.19972080595

Don't think this is a good benchmark for this as the fluctuations are too large.

mpirvu commented 4 years ago

That's a 6.8% drop in max_jOPS and 2.5% drop in critical_jOPS. Fluctuations are huge though min-max ~20%, so I am not sure we can draw any conclusions with such a small dataset.

mpirvu commented 4 years ago

My DT7 experiments with AOT enabled show a 2.1% regression when moving from shift 0 to shift 3

Results for JDK=/home/mpirvu/sdks/OpenJDK11U-jre_x64_linux_openj9_2020-05-21-10-15 jvmOpts=-Xms1024m -Xmx1024m -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=3530.25     min=3434.60     max=3587.30     stdDev=44.9     maxVar=4.45%    confInt=0.74%   samples=10
CompTime        avg=137425.30   min=128296.00   max=179999.00   stdDev=15107.7  maxVar=40.30%   confInt=6.37%   samples=10
Footprint       avg=932900.80   min=912844.00   max=948924.00   stdDev=9184.8   maxVar=3.95%    confInt=0.57%   samples=10

Results for JDK=/home/mpirvu/sdks/OpenJDK11U-jre_x64_linux_openj9_2020-05-21-10-15 jvmOpts=-Xms1024m -Xmx1024m -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=3455.40     min=3392.70     max=3521.00     stdDev=36.2     maxVar=3.78%    confInt=0.61%   samples=10
CompTime        avg=139633.70   min=132116.00   max=182410.00   stdDev=15162.1  maxVar=38.07%   confInt=6.29%   samples=10
Footprint       avg=930844.00   min=922164.00   max=945488.00   stdDev=7221.0   maxVar=2.53%    confInt=0.45%   samples=10
harryyu1994 commented 4 years ago

That's a 6.8% drop in max_jOPS and 2.5% drop in critical_jOPS. Fluctuations are huge though min-max ~20%, so I am not sure we can draw any conclusions with such a small dataset.

Not sure why fluctuations are so large. Originally the heap size is set to 24GB, I had to change it to 2GB to be able to use shift0. Maybe the test does not work well for a smaller heap..

harryyu1994 commented 4 years ago

ILOG_WODM 851-4way-Seg5FastpathRVEJB (on Power)

Shift 0

Job ID Global Throughput Average response time Min TPS Max TPS Pause Time Total Pause Time
9994528 8638.9371504854 1.8573037069837 504.7412381469 582.0975 13.859526383526 14.445926640927
9994529 8762.1322573549 1.8264175610697 534.71616320959 560.30609923475 13.895542351454 14.478946902655
9994530 8740.2146440973 1.835861262807 506.15 586.72603318492 13.904566037736 14.485850314465
9994531 8664.0184392624 1.8513182105413 510.4425 579.65 14.166381979695 14.671502538071
9994532 8577.5554538932 1.8694810666913 509.735 571.67 13.984703208556 14.73531684492
9994533 8691.1423921766 1.8423403917498 525.7875 567.96858007855 14.010517902813 14.684122762148
9994534 8502.5874741253 1.8831568605721 494.7275 555.5325 14.109282694848 14.744678996037
9994535 8658.9934676143 1.8495857046636 515.9275 564.43 13.919508322663 14.490800256082
means 8654.4476598762 1.8519330956348 512.77842516956 571.04758906228 13.981253610162 14.592143156913
medians 8661.5059534384 1.8504519576024 510.08875 569.81929003927 13.95210576561 14.581151397076
confidence_interval 0.0081348395932811 0.0082178422245874 0.02052919348065 0.016151072515423 0.0065275035989177 0.0073217207304309
min 8502.5874741253 1.8264175610697 494.7275 555.5325 13.859526383526 14.445926640927
max 8762.1322573549 1.8831568605721 534.71616320959 586.72603318492 14.166381979695 14.744678996037
stddev 84.198081874972 0.018201070854603 12.589702870929 11.030304909653 0.10914581344745 0.1277750589018

Shift 3

Job ID Global Throughput Average response time Min TPS Max TPS Pause Time Total Pause Time
9994541 8119.8086062203 1.9734693630701 482.52 533.94 13.927745554036 14.523974008208
9994542 8286.7210135417 1.9335406181712 484.4725 541.3 14.028153225806 14.621639784946
9994543 8238.9619674721 1.9421575015578 502.385 524.27 14.281521505376 14.873037634409
9994544 8388.4166737499 1.9096835513286 500.78124804688 547.5725 13.915002663116 14.43093608522
9994545 8408.0668386632 1.9034667560754 515.315 539.43 13.90702393617 14.452569148936
9994546 8298.7935361939 1.9281740577079 509.36622658443 526.865 13.899852393617 14.452348404255
9994547 8441.6219797253 1.8954390538826 523.06 533.69 14.077266311585 14.597009320905
9994548 8412.5731827431 1.9026361479098 513.48871627821 540.7625 13.878481333333 14.53224
means 8324.3704747887 1.9235708812129 503.92358636369 535.97875 13.98938086538 14.56046929836
medians 8343.6051049719 1.9189288045183 505.87561329222 536.685 13.921374108576 14.528107004104
confidence_interval 0.010993300632059 0.011362164570798 0.024010815613053 0.012185469952163 0.0081766615307008 0.0082655166059187
min 8119.8086062203 1.8954390538826 482.52 524.27 13.878481333333 14.43093608522
max 8441.6219797253 1.9734693630701 523.06 547.5725 14.281521505376 14.873037634409
stddev 109.44435177091 0.026138647857235 14.470563630049 7.8109472171159 0.13680071373816 0.14393261774689

Seeing a 4% drop in throughput on Power.

vijaysun-omr commented 4 years ago

@andrewcraik @zl-wang see above overhead(s)

harryyu1994 commented 4 years ago

I have also updated the original post for SPECjbb2015GMR. I don't think we can draw any conclusions from that particular benchmark as we always seem to have large fluctuations (multiple attempts and not a small dataset considering each run takes over 3 hours).

amicic commented 4 years ago

most of variability typically comes from JIT, but this heap size is also fairly small for jbb2015 and may contribute (by having large variations in number of global GCs that are relatively expensive) as well.

if these tests are to be repeated try heap as big as possible while still being able to run shift0, with 2GB given to Tenure and rest to Nursery (for example -Xmx3200M -Xms3200M -Xmn1200M)

zl-wang commented 4 years ago

maybe due to the shift-3 case's bigger measurement variability, the overhead looked like twice more than expected. we had prior experience with this overhead ... about 2-2.5%. it might be worth of another more stable measurement on shift-3 case.

harryyu1994 commented 4 years ago

most of variability typically comes from JIT, but this heap size is also fairly small for jbb2015 and may contribute (by having large variations in number of global GCs that are relatively expensive) as well.

if these tests are to be repeated try heap as big as possible while still being able to run shift0, with 2GB given to Tenure and rest to Nursery (for example -Xmx3200M -Xms3200M -Xmn1200M)

Tried with -Xmx3200M -Xms3200M -Xmn1200M on x86 The shift0 runs were pretty stable, the shift3 runs were not. 3.2% drop in max_jOPS and 2% drop in cirtical_jOPS Going to give this another try..

Shift 3

Job ID max_jOPS critical_jOPS hbIR_max hbIR_settled
9994645 18938 11502 23095 22360
9994646 18938 11227 23095 21847
9994647 19169 10885 23095 21419
9994648 21247 11684 23095 19279
means 19573 11324.5 23095 21226.25
medians 19053.5 11364.5 23095 21633
confidence_interval 0.09114537985632 0.048897960523052 0 0.1014854988189
min 18938 10885 23095 19279
max 21247 11684 23095 22360
stddev 1121.3001382324 348.04836828617   1353.9639027685

Shift 0

Job ID max_jOPS critical_jOPS hbIR_max hbIR_settled
9994636 20555 12086 23095 22548
9994637 20093 11388 23095 22360
9994638 20202 11982 27674 23095
9994639 20093 11496 23095 22706
means 20235.75 11738 24239.75 22677.25
medians 20147.5 11739 23095 22627
confidence_interval 0.017214402858354 0.047064354097083 0.15027360018152 0.021914250955476
min 20093 11388 23095 22360
max 20555 12086 27674 23095
stddev 218.94805319984 347.22903104435 2289.5 312.35383248276
dmitripivkine commented 4 years ago

Do you use -XXgc:forcedShiftingCompressionAmount=3 to force shift 3 ?

harryyu1994 commented 4 years ago

Do you use -XXgc:forcedShiftingCompressionAmount=3 to force shift 3 ?

Yes

harryyu1994 commented 4 years ago

With a larger dataset, I'm measuring a 3.5% throughput drop on Power.

First 8 runs: (8288.3387340932 vs. 8581.7531089419) Next 8 runs: (8316.9258529547 vs. 8627.474541907) Both have 3.5% throughput drop.

Job ID Global Throughput Average response time Min TPS Max TPS Pause Time Total Pause Time
9994855 8227.0338779673 1.9450356503132 503.18874202814 526.035 13.904885598923 14.4253243607
9994856 8087.2274532643 1.9783969421294 494.9975 517.92093638361 13.935554945055 14.535127747253
9994857 8317.3519463409 1.9245225413624 500.05874985313 537.415 14.015772543742 14.649139973082
9994858 8286.6920988556 1.9310648048965 504.66 527.8886802783 14.190730201342 14.867684563758
9994859 8207.8741850326 1.9494444556917 506.135 519.855 13.985191117093 14.581004037685
9994860 8392.4408795395 1.907375533393 503.745 545.16 13.938728 14.444909333333
9994861 8334.6407679616 1.920294696344 499.66 533.34116664708 14.107698795181 14.611708165997
9994862 8453.4486637834 1.8928913679089 509.36872657818 532.835 14.051724842767 14.721377358491
means 8288.3387340932 1.9311282490049 502.72671480743 530.05634791362 14.016285755513 14.604534442537
medians 8302.0220225982 1.9277936731294 503.46687101407 530.36184013915 14.000481830417 14.596356101841
confidence_interval 0.011550726630632 0.011514626729118 0.0073576725719721 0.014272083518414 0.0057905694289637 0.0083223576011986
min 8087.2274532643 1.8928913679089 494.9975 517.92093638361 13.904885598923 14.4253243607
max 8453.4486637834 1.9783969421294 509.36872657818 545.16 14.190730201342 14.867684563758
stddev 114.49608734331 0.026593458983634 4.4237061399032 9.0473890683649 0.097066208198194 0.14536101225863
             
Job ID Global Throughput Average response time Min TPS Max TPS Pause Time Total Pause Time
9994868 8429.0328467609 1.9004089367754 506.765 552.32 14.128936925099 14.630515111695
9994869 8687.8546821798 1.8426826705212 522.0325 564.32608918478 13.933602791878 14.440649746193
9994870 8468.1746712228 1.8895803645447 514.3225 542.93614265964 14.165972440945 14.769738845144
9994871 8627.8107263089 1.8540877613996 525.47618630953 544.83613790966 13.899932484076 14.414615286624
9994872 8670.780576214 1.8459311333678 526.85723142769 555.88111029722 14.051856780735 14.553642585551
9994873 8555.9774620284 1.8699833702903 527.8925 541.9025 13.964604139715 14.498144890039
9994874 8579.5265630195 1.8652614005592 516.575 546.735 13.892619607843 14.611988235294
9994875 8634.8673438012 1.854260194939 509.6125 568.4310789223 13.914638569604 14.44150063857
means 8581.7531089419 1.8652744790497 518.69167721715 552.1710073717 13.994020467487 14.545099417389
medians 8603.6686446642 1.8597607977491 519.30375 549.5275 13.949103465797 14.525893737795
confidence_interval 0.0090968857000075 0.0092466689375262 0.013021956977964 0.015143005290046 0.0064300747604867 0.0069793709871253
min 8429.0328467609 1.8426826705212 506.765 541.9025 13.892619607843 14.414615286624
max 8687.8546821798 1.9004089367754 527.8925 568.4310789223 14.165972440945 14.769738845144
stddev 93.364677712505 0.02062727721853 8.0779169549425 9.9999889949771 0.107614892342 0.12140786619901
             
Job ID Global Throughput Average response time Min TPS Max TPS Pause Time Total Pause Time
9994930 8177.5107248928 1.9590636485304 487.47 541.2825 14.013406035665 14.80304526749
9994931 8432.8923393013 1.8986411307841 507.995 551.18 13.983329333333 14.492921333333
9994932 8228.0696491294 1.9454009387451 489.7725 529.6475 13.910288227334 14.588560216509
9994933 8403.3447909506 1.9049858210628 508.8425 543.6725 13.894525827815 14.501691390728
9994934 8385.7767133493 1.9085944995934 508.14622963443 541.58364604088 13.929327516778 14.648150335571
9994935 8330.2474844166 1.9258762781886 486.82378294054 558.22860442849 14.191576974565 14.858611780455
9994936 8267.1334929976 1.9375925939066 488.9325 546.98 14.161099319728 14.669331972789
9994937 8310.4316285999 1.9272640311479 494.9175 544.84113789716 14.038435549525 14.781244233378
means 8316.9258529547 1.9259273677449 496.61250157187 544.67698604582 14.015248598093 14.667944566282
medians 8320.3395565083 1.9265701546682 492.345 544.25681894858 13.998367684499 14.65874115418
confidence_interval 0.0089671497091839 0.0091344216653754 0.016839629548937 0.012702234285658 0.0066472032951229 0.0078402754117856
min 8177.5107248928 1.8986411307841 486.82378294054 529.6475 13.894525827815 14.492921333333
max 8432.8923393013 1.9590636485304 508.8425 558.22860442849 14.191576974565 14.858611780455
stddev 89.19306714941 0.021039470646772 10.001474451658 8.2743329580123 0.11141755278116 0.13753537856553
             
Job ID Global Throughput Average response time Min TPS Max TPS Pause Time Total Pause Time
9994917 8635.7070897714 1.853296592018 524.31 556.5375 14.183319371728 14.806257853403
9994918 8621.7088487345 1.8558855922125 531.7725 545.95 13.98879791395 14.501852672751
9994919 8533.2685029174 1.8753910779895 522.7925 546.08863477841 13.958782664942 14.470852522639
9994920 8574.0599079032 1.8675608874445 513.61871595321 564.5675 13.962389175258 14.472323453608
9994921 8510.2283664363 1.8815516956704 505.67 550.7275 14.622058124174 15.321787318362
9994922 8703.9959200816 1.8385274078447 526.2775 557.65 13.938370656371 14.615827541828
9994923 8781.1981281123 1.8221623975359 535.02 559.17860205349 14.360284634761 14.88064231738
9994924 8659.6295712997 1.8498789700963 506.085 569.3425 14.483114068441 14.988921419518
means 8627.474541907 1.8555318276015 520.69327699415 556.25527960399 14.187139576203 14.757308137436
medians 8628.707969253 1.8545910921153 523.55125 557.09375 14.086058642839 14.711042697615
confidence_interval 0.008675996428163 0.0087775211883997 0.017859539791627 0.012590065519452 0.015918315024142 0.017111525799816
min 8510.2283664363 1.8221623975359 505.67 545.95 13.938370656371 14.470852522639
max 8781.1981281123 1.8815516956704 535.02 569.3425 14.622058124174 15.321787318362
stddev 89.519345731438 0.019478438704893 11.121569557209 8.37560108854 0.27008830852039 0.30200193835895
harryyu1994 commented 4 years ago

SPECjbb2015 on x86. No throughput drop observed this time. (I grabbed the build from a different location this time, non-source code version of that build)

-Xmx3200M -Xms3200M -Xmn1200M

Shift0

Job ID max_jOPS critical_jOPS hbIR_max hbIR_settled
9994899 20202 11928 27674 23095
9994900 19649 11962 23392 22775
9994901 20324 11498 23095 21847
9994902 20479 11678 27674 23095
means 20163.5 11766.5 25458.75 22703
medians 20263 11803 25533 22935
confidence_interval 0.028503988445181 0.029647326634529 0.16003411426044 0.041365280701093
min 19649 11498 23095 21847
max 20479 11962 27674 23095
stddev 361.24460780289 219.26163975184 2560.8224427581 590.26773586229
Job ID max_jOPS critical_jOPS hbIR_max hbIR_settled
9995099 20913 11952 24604 23716
9995100 19631 11039 23095 21062
9995101        
9995102        
means 20272 11495.5 23849.5 22389
medians 20272 11495.5 23849.5 22389
confidence_interval 0.14229072923525 0.17870145527142 0.14236234682501 0.26671743115415
min 19631 11039 23095 21062
max 20913 11952 24604 23716
stddev 906.51089348115 645.58849122332 1067.0241328105 1876.6613972691

Shift3

Job ID max_jOPS critical_jOPS hbIR_max hbIR_settled
9994908 20785 12163 23095 20517
9994909 20093 11586 23095 22360
9994910 20785 11861 23095 21847
9994911 20324 12112 23095 20517
means 20496.75 11930.5 23095 21310.25
medians 20554.5 11986.5 23095 21182
confidence_interval 0.026852923897151 0.035325331114737 0 0.070149805133889
min 20093 11586 23095 20517
max 20785 12163 23095 22360
stddev 345.94448013133 264.89557691035   939.60395025422
Job ID max_jOPS critical_jOPS hbIR_max hbIR_settled
9995108 20755 12015 27674 23095
9995109 21309 12721 27674 23095
9995110 21016 12134 23095 21847
9995111 20755 12249 27674 23095
means 20958.75 12279.75 26529.25 22783
medians 20885.5 12191.5 27674 23095
confidence_interval 0.020035367673363 0.040072637108334 0.13730484276789 0.043575648509854
min 20755 12015 23095 21847
max 21309 12721 27674 23095
stddev 263.93228298183 309.29099027723 2289.5 624
harryyu1994 commented 4 years ago

Judging from the various experiments we tried, the overhead of shift3 on x86 isn't very significant.