Open dsouzai opened 4 years ago
fyi @dmitripivkine @amicic
Can one of you comment on what would be affected in the GC (and rest of the runtime) by forcing the CR shift to always be 3? How would this affect the heap size? Occupancy? More wasted space? etc
also fyi @fjeremic @andrewcraik @gita-omr @knn-k to give a codegen perspective.
For a large part GC would not be affected (roots and metastructures like rembembered set do not use CR). The biggest performance impact, I geuess, would come from jitted code, which is more in JIT folks to comment, like how an unnecessary shift for <4GB heap would affect performance.
I don't see obvious footprint implications.
Don't remember if this was discussed already, but what I struggle more is, jitted code (non)portability in unified CR/nonCR VM.
jitted code (non)portability in unified CR/nonCR VM.
This was brought up in the discussion; running in CR / non-CR results in a different SCC. I suppose theoretically the same SCC could be used to store both CR and non-CR versions of a compiled method, but that's better discussed in another issue (which I can open if we feel it's a discussion worth having).
Don't remember if this was discussed already, but what I struggle more is, jitted code (non)portability in unified CR/nonCR VM.
The CR-ness is encoded in the cache name today so that they are forced to be separate caches. I expect the initial unified CR/nonCR VM will still use separate caches for the initial approach.
fyi @dmitripivkine @amicic
Can one of you comment on what would be affected in the GC (and rest of the runtime) by forcing the CR shift to always be 3? How would this affect the heap size? Occupancy? More wasted space? etc
From the codegen perspective proposed solution 1. is going to be less performant and harder to functionally implement correctly. 2. will give most performance but least flexibility.
@fjeremic @andrewcraik Do you guys have any issues/concerns with moving forward with solution 2 here (fixing the shift value to 3 for portable AOT).
Summary of the solution:
-XX:+PortableSharedCache
is specified, the compressedref shift will be fixed to 3 (3 for <= 32g and 4 for > 32g and nocompressedref > 64g. I could be wrong on these numbers.).FYI @mpirvu @vijaysun-omr @ymanton @dsouzai
the compressedref shift will be fixed to 3 (3 for <= 32gb and 4 for > 32gb).
Worth noting that there is a point when even shift 4 won't work, eg if the heap is so big that we have to run without compressedrefs; I don't know if the JVM will still explicitly require the user to pass in -Xnocompressedrefs
once we have one build that can do both.
If we can't use compressedrefs, what should the portable AOT story be? Should we generate a SCC with the default CPU features & nocompressedrefs (thereby mandating that all future layers must have compressedrefs disabled)?
If we can't use compressedrefs, what should the portable AOT story be? Should we generate a SCC with the default CPU features & nocompressedrefs (thereby mandating that all future layers must have compressedrefs disabled)?
It sounds reasonable to me as we are giving the same treatment to shift 3/shift 4/nocompressedrefs. This shouldn't be that bad if most of the use cases fall under shift 3.
Note compressedrefs and non-compressedrefs don't share any cache, there are different cache files for these atm.
Is it possible to estimate the impact of applying a constant shift by 3 in compiled code ? In general I am supportive of this proposed direction that you are taking but I wanted to get a sense for how much slower we would be making portable AOT code because of compressed refs shift. My expectation/hope would be that the overhead is'nt much more than 5% but this is what I'm asking to be measured.
Is it possible to estimate the impact of applying a constant shift by 3 in compiled code ? In general I am supportive of this proposed direction that you are taking but I wanted to get a sense for how much slower we would be making portable AOT code because of compressed refs shift. My expectation/hope would be that the overhead is'nt much more than 5% but this is what I'm asking to be measured.
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput avg=2850.45 min=2759.30 max=2900.90 stdDev=45.5 maxVar=5.13% confInt=1.07% samples= 8
Intermediate results:
Run 0 187.3 2575.9 2890.0 2861.1 Avg=2861 CPU=113492 ms Footprint=712364 KB
Run 1 200.6 2573.6 2887.4 2867.5 Avg=2868 CPU=108388 ms Footprint=700740 KB
Run 2 221.0 2579.6 2901.8 2759.3 Avg=2759 CPU=112528 ms Footprint=699352 KB
Run 3 222.1 2628.1 2830.4 2892.9 Avg=2893 CPU=107786 ms Footprint=706204 KB
Run 4 180.4 2628.9 2903.8 2830.5 Avg=2830 CPU=108510 ms Footprint=706704 KB
Run 5 226.5 2598.6 2705.5 2867.6 Avg=2868 CPU=112368 ms Footprint=713240 KB
Run 6 221.2 2647.2 2837.1 2900.9 Avg=2901 CPU=110313 ms Footprint=698736 KB
Run 7 231.8 2619.8 2928.2 2823.8 Avg=2824 CPU=110404 ms Footprint=707608 KB
CompTime avg=110473.62 min=107786.00 max=113492.00 stdDev=2150.7 maxVar=5.29% confInt=1.30% samples= 8
Footprint avg=705618.50 min=698736.00 max=713240.00 stdDev=5599.8 maxVar=2.08% confInt=0.53% samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput avg=2819.82 min=2776.80 max=2861.30 stdDev=29.0 maxVar=3.04% confInt=0.69% samples= 8
Intermediate results:
Run 0 163.2 2442.4 2795.4 2776.8 Avg=2777 CPU=137886 ms Footprint=651608 KB
Run 1 144.9 2350.9 2826.5 2847.1 Avg=2847 CPU=137913 ms Footprint=636708 KB
Run 2 152.9 2429.0 2857.6 2826.9 Avg=2827 CPU=131592 ms Footprint=637768 KB
Run 3 174.9 2363.9 2790.5 2832.9 Avg=2833 CPU=140504 ms Footprint=642980 KB
Run 4 161.6 2433.0 2810.8 2803.1 Avg=2803 CPU=132384 ms Footprint=632412 KB
Run 5 139.9 2409.7 2819.2 2861.3 Avg=2861 CPU=132907 ms Footprint=649168 KB
Run 6 177.9 2467.1 2801.9 2787.1 Avg=2787 CPU=137302 ms Footprint=636512 KB
Run 7 178.0 2431.0 2764.5 2823.4 Avg=2823 CPU=133845 ms Footprint=638280 KB
CompTime avg=135541.62 min=131592.00 max=140504.00 stdDev=3256.5 maxVar=6.77% confInt=1.61% samples= 8
Footprint avg=640679.50 min=632412.00 max=651608.00 stdDev=6681.6 maxVar=3.04% confInt=0.70% samples= 8
Shift0:
run0: summary = 2938081 in 600s = 4896.7/s Avg: 1 Min: 0 Max: 891 Err: 0 (0.00%)
run1: summary = 3136727 in 600s = 5227.7/s Avg: 1 Min: 0 Max: 129 Err: 0 (0.00%)
run2: summary = 3147370 in 600s = 5245.4/s Avg: 1 Min: 0 Max: 109 Err: 0 (0.00%)
run3: summary = 3139280 in 600s = 5232.0/s Avg: 1 Min: 0 Max: 117 Err: 0 (0.00%)
run4: summary = 3133830 in 600s = 5222.8/s Avg: 1 Min: 0 Max: 79 Err: 0 (0.00%)
run5: summary = 3136712 in 600s = 5227.7/s Avg: 1 Min: 0 Max: 156 Err: 0 (0.00%)
5231.12
Shift3:
run0: summary = 2964754 in 600s = 4941.1/s Avg: 1 Min: 0 Max: 260 Err: 0 (0.00%)
run1: summary = 3137234 in 600s = 5228.3/s Avg: 1 Min: 0 Max: 124 Err: 0 (0.00%)
run2: summary = 3126874 in 600s = 5211.3/s Avg: 1 Min: 0 Max: 110 Err: 0 (0.00%)
run3: summary = 3139452 in 600s = 5232.2/s Avg: 1 Min: 0 Max: 64 Err: 0 (0.00%)
run4: summary = 3134675 in 600s = 5224.3/s Avg: 1 Min: 0 Max: 100 Err: 0 (0.00%)
run5: summary = 3139328 in 600s = 5232.1/s Avg: 1 Min: 0 Max: 113 Err: 0 (0.00%)
5225.64
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:none -Xnoaot -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput avg=2997.16 min=2976.00 max=3022.10 stdDev=13.9 maxVar=1.55% confInt=0.31% samples= 8
Intermediate results:
Run 0 161.6 2661.4 3026.7 3022.1 Avg=3022 CPU=131266 ms Footprint=597944 KB
Run 1 175.7 2600.5 3029.5 2990.0 Avg=2990 CPU=127742 ms Footprint=594136 KB
Run 2 178.5 2622.8 3002.0 2984.6 Avg=2985 CPU=130536 ms Footprint=602824 KB
Run 3 161.8 2686.5 3003.7 3000.8 Avg=3001 CPU=129596 ms Footprint=596588 KB
Run 4 131.0 2617.7 2820.5 2976.0 Avg=2976 CPU=143361 ms Footprint=603160 KB
Run 5 157.4 2657.5 3016.9 2999.4 Avg=2999 CPU=129978 ms Footprint=604736 KB
Run 6 175.0 2656.1 2978.4 3001.4 Avg=3001 CPU=130472 ms Footprint=592384 KB
Run 7 192.8 2599.7 3042.6 3003.0 Avg=3003 CPU=130239 ms Footprint=613256 KB
CompTime avg=131648.75 min=127742.00 max=143361.00 stdDev=4843.3 maxVar=12.23% confInt=2.46% samples= 8
Footprint avg=600628.50 min=592384.00 max=613256.00 stdDev=6774.0 maxVar=3.52% confInt=0.76% samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:none -Xnoaot -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput avg=2933.98 min=2877.50 max=2974.10 stdDev=29.9 maxVar=3.36% confInt=0.68% samples= 8
Intermediate results:
Run 0 173.7 2609.1 2960.8 2918.4 Avg=2918 CPU=133268 ms Footprint=599836 KB
Run 1 179.3 2641.4 2952.0 2920.2 Avg=2920 CPU=128775 ms Footprint=597212 KB
Run 2 154.8 2589.6 2955.9 2956.3 Avg=2956 CPU=131750 ms Footprint=606756 KB
Run 3 191.9 2553.4 2955.2 2945.6 Avg=2946 CPU=129687 ms Footprint=597068 KB
Run 4 149.1 2640.5 2965.7 2974.1 Avg=2974 CPU=129883 ms Footprint=600332 KB
Run 5 193.3 2638.6 2941.3 2927.3 Avg=2927 CPU=127431 ms Footprint=603852 KB
Run 6 168.7 2530.4 2892.4 2877.5 Avg=2878 CPU=145817 ms Footprint=596084 KB
Run 7 176.7 2569.6 2945.2 2952.4 Avg=2952 CPU=134123 ms Footprint=596880 KB
CompTime avg=132591.75 min=127431.00 max=145817.00 stdDev=5798.9 maxVar=14.43% confInt=2.93% samples= 8
Footprint avg=599752.50 min=596084.00 max=606756.00 stdDev=3809.2 maxVar=1.79% confInt=0.43% samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput avg=2964.54 min=2933.90 max=3002.30 stdDev=25.2 maxVar=2.33% confInt=0.57% samples= 8
Intermediate results:
Run 0 220.7 2656.6 2940.8 2933.9 Avg=2934 CPU=103509 ms Footprint=712676 KB
Run 1 228.4 2674.1 2920.4 2950.9 Avg=2951 CPU=105973 ms Footprint=719892 KB
Run 2 223.8 2727.0 2960.8 2947.2 Avg=2947 CPU=103124 ms Footprint=704384 KB
Run 3 215.6 2704.4 2978.5 2978.7 Avg=2979 CPU=103663 ms Footprint=709576 KB
Run 4 235.9 2666.1 2967.8 3002.3 Avg=3002 CPU=103964 ms Footprint=710316 KB
Run 5 218.4 2676.8 2964.8 2997.3 Avg=2997 CPU=101415 ms Footprint=704660 KB
Run 6 176.1 2719.4 2953.1 2958.0 Avg=2958 CPU=103691 ms Footprint=726336 KB
Run 7 214.4 2654.4 2957.3 2948.0 Avg=2948 CPU=106512 ms Footprint=714952 KB
CompTime avg=103981.38 min=101415.00 max=106512.00 stdDev=1608.1 maxVar=5.03% confInt=1.04% samples= 8
Footprint avg=712849.00 min=704384.00 max=726336.00 stdDev=7481.4 maxVar=3.12% confInt=0.70% samples= 8
Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput avg=2889.20 min=2842.80 max=2943.50 stdDev=37.0 maxVar=3.54% confInt=0.86% samples= 8
Intermediate results:
Run 0 181.7 2536.2 2979.7 2933.4 Avg=2933 CPU=123566 ms Footprint=693680 KB
Run 1 185.0 2523.2 2887.1 2910.4 Avg=2910 CPU=128682 ms Footprint=640464 KB
Run 2 178.7 2494.4 2877.1 2943.5 Avg=2944 CPU=128197 ms Footprint=637768 KB
Run 3 175.7 2553.8 2879.9 2889.7 Avg=2890 CPU=129298 ms Footprint=639216 KB
Run 4 157.6 2498.3 2862.5 2875.5 Avg=2876 CPU=123893 ms Footprint=640976 KB
Run 5 187.6 2497.7 2868.2 2842.8 Avg=2843 CPU=125097 ms Footprint=632568 KB
Run 6 178.9 2420.8 2828.6 2865.9 Avg=2866 CPU=127235 ms Footprint=637264 KB
Run 7 173.4 2305.5 2862.9 2852.4 Avg=2852 CPU=119157 ms Footprint=636492 KB
CompTime avg=125640.62 min=119157.00 max=129298.00 stdDev=3410.0 maxVar=8.51% confInt=1.82% samples= 8
Footprint avg=644803.50 min=632568.00 max=693680.00 stdDev=19923.9 maxVar=9.66% confInt=2.07% samples= 8
I have spent some time coming up with an implementation and here's the whole story:
First, recap on the original compressed shift design:
-XX:+PortableSharedCache
during cold run, if the compressed shift value is <=3 then 3 will be used and persisted to the shared class cache, if the compressed shift value is 4 then 4 will be used-XX:+PotableSharedCache
during the warm run) I proceeded to implement this, and have found some limitations with our existing infrastructure in the codebase:
initializeRunTimeObjectAlignmentAndCRShift()
, it is in fact earlier than the earliest point the vm is able to load the SCC. As a result, with the current infrastructure it may not be possible to pick up the CR shift value from the SCC and then set it to the current JVM.-XX:+PortableSharedCache
is yet to be parsed and processed when initializeRunTimeObjectAlignmentAndCRShift()
is called, but this looks possible to be worked around without too much effort.Due to the limitations, I'm proposing an alternative solution
-XX:+PortableSharedCache
, if the compressed shift value <= 3 we will fix the compressed shift to 3 and if the compressed shift value is 4 then we will generate a warning message to the user that the heap may be too large for the portableSharedCache option to work.@vijaysun-omr @mpirvu @dsouzai Let me know if there are any concerns with the alternative solution.
@dmitripivkine Does the CR shift change the object alignment requirements? Does forcing a 3bit shift mean small heaps will waste more space and therefore incur higher gc overhead?
if the compressed shift value is 4 then we will generate a warning message to the user that the heap may be too large for the portableSharedCache option to work.
I assume we will continue to generate AOT which is portable from the processor point of view.
@dmitripivkine Does the CR shift change the object alignment requirements? Does forcing a 3bit shift mean small heaps will waste more space and therefore incur higher gc overhead?
only for 4-bit shift (required 16 bytes alignment for objects). All other cases are covered by minimum heap object alignment to 8 bytes
I assume we will continue to generate AOT which is portable from the processor point of view.
Yes, we will always use the portable processor feature set when -XX:+PortableSharedCache
is specified.
@DanHeidinga @vijaysun-omr Moving question here: https://github.com/eclipse/omr/pull/5436 force any run in container to use shift 3 (except it needs to use shift 4). It obviously prevents to use most performant shift 0 for small container applications. Would you please confirm this is desired behaviour?
I think in containers the plan is to sacrifice a bit of performance in exchange for maximum portability. I have run some experiment comparing shift0 and shift3 and didn't see a significant throughput drop. I'll leave the decision to Vijay @vijaysun-omr though.
Moving question here: eclipse/omr#5436 force any run in container to use shift 3 (except it needs to use shift 4). It obviously prevents to use most performant shift 0 for small container applications. Would you please confirm this is desired behaviour?
My understanding is that this is the expected behaviour only when -XX:+PortableSharedCache
is specified and in that case we accept the tradeoff for better portability.
My understanding is that this is the expected behaviour only when
-XX:+PortableSharedCache
is specified and in that case we accept the tradeoff for better portability.
Yes that is correct. But in containers the PortableSharedCache feature is enabled by default. In containers, the portable processor feature set will be used by default for AOT compilations unless disabled by -XX:-PortableSharedCache
. The question here is whether we want to also have the shift set to 3 by default in containers.
The question here is whether we want to also have the shift set to 3 by default in containers.
I would say yes, but only if AOT is enabled.
The shift by 3 code is only generated for an AOT compilation in containers.
So, in the unaffected category are : 1) JIT compilations inside or outside containers and 2) AOT compilations outside containers.
This is a conscious choice being made in the same philosophy as the change to use "portable processor feature set", i.e. we want to make the code portable at a small performance cost in containers (in the case of compressed refs shift, the cost was negligible as @harryyu1994 measured). Since AOT compilations can (and are) usually recompiled as JIT compilations if they are deemed important for peak throughput (since this is not the first way and also nor the most significant way that AOT compilations are worse than JIT compilations), the steady state impact won't even be as much as measured in the AOT experiments with and without portability changes. During startup and rampup phases when AOT code is used heavily, these minor performance differences are unlikely to be a big enough deal to compromise on portability in containers.
The shift by 3 code is only generated for an AOT compilation in containers.
So, in the unaffected category are : 1) JIT compilations inside or outside containers and 2) AOT compilations outside containers.
This is a conscious choice being made in the same philosophy as the change to use "portable processor feature set", i.e. we want to make the code portable at a small performance cost in containers (in the case of compressed refs shift, the cost was negligible as @harryyu1994 measured). Since AOT compilations can (and are) usually recompiled as JIT compilations if they are deemed important for peak throughput (since this is not the first way and also nor the most significant way that AOT compilations are worse than JIT compilations), the steady state impact won't even be as much as measured in the AOT experiments with and without portability changes. During startup and rampup phases when AOT code is used heavily, these minor performance differences are unlikely to be a big enough deal to compromise on portability in containers.
Correct me if I'm wrong, but I thought the JIT compilations inside containers would also have to use shift3 if we made the AOT compilations shift by 3. So JIT compilations inside containers are affected (though we didn't see a throughput drop in my experiment when comparing AOT+JIT shift0 vs. AOT+JIT shift3).
The question here is whether we want to also have the shift set to 3 by default in containers.
I would say yes, but only if AOT is enabled.
May not be possible to check for whether AOT is enabled this early.
enum INIT_STAGE {
PORT_LIBRARY_GUARANTEED, 0
ALL_DEFAULT_LIBRARIES_LOADED, 1
ALL_LIBRARIES_LOADED, 2
DLL_LOAD_TABLE_FINALIZED, 3 Consume JIT specific X options
VM_THREADING_INITIALIZED, 4
HEAP_STRUCTURES_INITIALIZED, 5
ALL_VM_ARGS_CONSUMED, 6
The shift is set at ALL_LIBRARIES_LOADED
, very early into the initialization.
Looking at the code https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L34-L66 I see that vm->sharedCacheAPI->sharedCacheEnabled
is set very early and SCC options are also parsed very early. But then there is this piece of code in the same function: https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L304-L328
which seems to deal with -Xshareclasses:none
option. I wonder why can't this piece of code be grouped with the previous one. That way we would know very early whether SCC is 'likely' going to be used. @hangshao0
@harryyu1994 Discussing with @mpirvu some more, I feel that we need more data points if we are going to slow down JITed code (in addition to AOTed code) inside containers. Could you please run SPECjbb2015 (please ask Piyush if you need help with accessing a setup for it) and maybe SPECjbb2005 (that is much easier to run) and check what the throughput overhead is ?
Additionally, the overhead of the shift would be platform dependent and so if one wanted to take a design decision for all platforms, the effect of the shift ought to be measured on the other platforms first.
I would also add quarkus throughput experiments since quarkus is more likely to be run in containers.
But then there is this piece of code in the same function: https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L304-L328 which seems to deal with -Xshareclasses:none option. I wonder why can't this piece of code be grouped with the previous one. That way we would know very early whether SCC is 'likely' going to be used.
Looking at the code, what it does is to unload the SCC dll if -Xshareclasses:none presents. I guess that is the reason why it is done in stage DLL_LOAD_TABLE_FINALIZED
. Once the SCC dll in unloaded, all SCC related functionalities will be inactive.
unload the SCC dll if -Xshareclasses:none presents
This means that we load the SCC dll before checking the command line options. Be that as it may, we could add another check for -Xshareclasses:none
when SCC options are parsed.
we could add another check for -Xshareclasses:none when SCC options are parsed.
Yes. It looks fine to me if another check for -Xshareclasses:none
is added in the block L34 to L66.
Quarkus+CRUD on x86 loses 0.9% in throughput when we force a shift3 instead on shift0 for compresssedrefs
Stats for rest-crud-quarkus-openj9:j11 with JAVA_OPTS=-Xms128m -Xmx128m -Xshareclasses:none -XXgc:forcedShiftingCompressionAmount=3
Throughput: avg=12040.20 min=11931.00 max=12111.10 stdDev=57.7 maxVar=1.51% confInt=0.28% samples=10
Footprint: avg=123.01 min=105.90 max=129.90 stdDev=6.7 maxVar=22.66% confInt=3.17% samples=10
Stats for rest-crud-quarkus-openj9:j11 with JAVA_OPTS=-Xms128m -Xmx128m -Xshareclasses:none -XXgc:forcedShiftingCompressionAmount=0
Throughput: avg=12140.48 min=12065.70 max=12209.50 stdDev=48.3 maxVar=1.19% confInt=0.23% samples=10
Footprint: avg=125.31 min=120.40 max=129.30 stdDev=2.8 maxVar=7.39% confInt=1.28% samples=10
-Xms2g -Xmx2g -Xmn1g -Xgcpolicy:gencon -Xlp -Xcompressedrefs
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
---|---|---|---|---|
9962189 | 12177 | 4435 | 13838 | 13099 |
9962190 | 13688 | 5824 | 19279 | 16099 |
9962191 | 14006 | 5749 | 16099 | 13449 |
9962192 | 11277 | 4417 | 13587 | 13415 |
means | 12787 | 5106.25 | 15700.75 | 14015.5 |
medians | 12932.5 | 5092 | 14968.5 | 13432 |
confidence_interval | 0.15982393515568 | 0.24493717290884 | 0.26746374600271 | 0.1586869314041 |
min | 11277 | 4417 | 13587 | 13099 |
max | 14006 | 5824 | 19279 | 16099 |
stddev | 1284.5183273637 | 786.11592656554 | 2639.4603457273 | 1397.9111798203 |
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
---|---|---|---|---|
9962180 | 12718 | 4750 | 16099 | 14334 |
9962181 | 14972 | 5971 | 16099 | 13449 |
9962182 | 13040 | 4531 | 16099 | 13795 |
9962183 | 14167 | 5686 | 16099 | 13449 |
means | 13724.25 | 5234.5 | 16099 | 13756.75 |
medians | 13603.5 | 5218 | 16099 | 13622 |
confidence_interval | 0.12035581689665 | 0.21319070129398 | 0 | 0.048339382455907 |
min | 12718 | 4531 | 16099 | 13449 |
max | 14972 | 5971 | 16099 | 14334 |
stddev | 1038.2107605555 | 701.41214702912 | 417.97158994362 |
Added more runs
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
---|---|---|---|---|
9994492 | 13253 | 5439 | 16566 | 15678 |
9994493 | 13445 | 5534 | 14939 | 14322 |
9994494 | ||||
9994495 | 15133 | 6707 | 16099 | 13449 |
means | 13943.666666667 | 5893.3333333333 | 15868 | 14483 |
medians | 13445 | 5534 | 16099 | 14322 |
confidence_interval | 0.15737812859215 | 0.2542198958441 | 0.11199389127022 | 0.16451397337762 |
min | 13253 | 5439 | 14939 | 13449 |
max | 15133 | 6707 | 16566 | 15678 |
stddev | 1034.4570234347 | 706.25514747387 | 837.73683218538 | 1123.1878738662 |
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
---|---|---|---|---|
9994483 | 12508 | 5022 | 13449 | 13024 |
9994484 | 13146 | 5101 | 14939 | 14322 |
9994485 | 16099 | 6128 | 16099 | 13449 |
9994486 | 13362 | 5131 | 16099 | 13795 |
means | 13778.75 | 5345.5 | 15146.5 | 13647.5 |
medians | 13254 | 5116 | 15519 | 13622 |
confidence_interval | 0.18344971565841 | 0.1558672596321 | 0.13202131493524 | 0.064024675274026 |
min | 12508 | 5022 | 13449 | 13024 |
max | 16099 | 6128 | 16099 | 14322 |
stddev | 1588.754097818 | 523.68852065581 | 1256.8578545988 | 549.19972080595 |
Don't think this is a good benchmark for this as the fluctuations are too large.
That's a 6.8% drop in max_jOPS and 2.5% drop in critical_jOPS. Fluctuations are huge though min-max ~20%, so I am not sure we can draw any conclusions with such a small dataset.
My DT7 experiments with AOT enabled show a 2.1% regression when moving from shift 0 to shift 3
Results for JDK=/home/mpirvu/sdks/OpenJDK11U-jre_x64_linux_openj9_2020-05-21-10-15 jvmOpts=-Xms1024m -Xmx1024m -XXgc:forcedShiftingCompressionAmount=0
Throughput avg=3530.25 min=3434.60 max=3587.30 stdDev=44.9 maxVar=4.45% confInt=0.74% samples=10
CompTime avg=137425.30 min=128296.00 max=179999.00 stdDev=15107.7 maxVar=40.30% confInt=6.37% samples=10
Footprint avg=932900.80 min=912844.00 max=948924.00 stdDev=9184.8 maxVar=3.95% confInt=0.57% samples=10
Results for JDK=/home/mpirvu/sdks/OpenJDK11U-jre_x64_linux_openj9_2020-05-21-10-15 jvmOpts=-Xms1024m -Xmx1024m -XXgc:forcedShiftingCompressionAmount=3
Throughput avg=3455.40 min=3392.70 max=3521.00 stdDev=36.2 maxVar=3.78% confInt=0.61% samples=10
CompTime avg=139633.70 min=132116.00 max=182410.00 stdDev=15162.1 maxVar=38.07% confInt=6.29% samples=10
Footprint avg=930844.00 min=922164.00 max=945488.00 stdDev=7221.0 maxVar=2.53% confInt=0.45% samples=10
That's a 6.8% drop in max_jOPS and 2.5% drop in critical_jOPS. Fluctuations are huge though min-max ~20%, so I am not sure we can draw any conclusions with such a small dataset.
Not sure why fluctuations are so large. Originally the heap size is set to 24GB, I had to change it to 2GB to be able to use shift0. Maybe the test does not work well for a smaller heap..
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time |
---|---|---|---|---|---|---|
9994528 | 8638.9371504854 | 1.8573037069837 | 504.7412381469 | 582.0975 | 13.859526383526 | 14.445926640927 |
9994529 | 8762.1322573549 | 1.8264175610697 | 534.71616320959 | 560.30609923475 | 13.895542351454 | 14.478946902655 |
9994530 | 8740.2146440973 | 1.835861262807 | 506.15 | 586.72603318492 | 13.904566037736 | 14.485850314465 |
9994531 | 8664.0184392624 | 1.8513182105413 | 510.4425 | 579.65 | 14.166381979695 | 14.671502538071 |
9994532 | 8577.5554538932 | 1.8694810666913 | 509.735 | 571.67 | 13.984703208556 | 14.73531684492 |
9994533 | 8691.1423921766 | 1.8423403917498 | 525.7875 | 567.96858007855 | 14.010517902813 | 14.684122762148 |
9994534 | 8502.5874741253 | 1.8831568605721 | 494.7275 | 555.5325 | 14.109282694848 | 14.744678996037 |
9994535 | 8658.9934676143 | 1.8495857046636 | 515.9275 | 564.43 | 13.919508322663 | 14.490800256082 |
means | 8654.4476598762 | 1.8519330956348 | 512.77842516956 | 571.04758906228 | 13.981253610162 | 14.592143156913 |
medians | 8661.5059534384 | 1.8504519576024 | 510.08875 | 569.81929003927 | 13.95210576561 | 14.581151397076 |
confidence_interval | 0.0081348395932811 | 0.0082178422245874 | 0.02052919348065 | 0.016151072515423 | 0.0065275035989177 | 0.0073217207304309 |
min | 8502.5874741253 | 1.8264175610697 | 494.7275 | 555.5325 | 13.859526383526 | 14.445926640927 |
max | 8762.1322573549 | 1.8831568605721 | 534.71616320959 | 586.72603318492 | 14.166381979695 | 14.744678996037 |
stddev | 84.198081874972 | 0.018201070854603 | 12.589702870929 | 11.030304909653 | 0.10914581344745 | 0.1277750589018 |
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time |
---|---|---|---|---|---|---|
9994541 | 8119.8086062203 | 1.9734693630701 | 482.52 | 533.94 | 13.927745554036 | 14.523974008208 |
9994542 | 8286.7210135417 | 1.9335406181712 | 484.4725 | 541.3 | 14.028153225806 | 14.621639784946 |
9994543 | 8238.9619674721 | 1.9421575015578 | 502.385 | 524.27 | 14.281521505376 | 14.873037634409 |
9994544 | 8388.4166737499 | 1.9096835513286 | 500.78124804688 | 547.5725 | 13.915002663116 | 14.43093608522 |
9994545 | 8408.0668386632 | 1.9034667560754 | 515.315 | 539.43 | 13.90702393617 | 14.452569148936 |
9994546 | 8298.7935361939 | 1.9281740577079 | 509.36622658443 | 526.865 | 13.899852393617 | 14.452348404255 |
9994547 | 8441.6219797253 | 1.8954390538826 | 523.06 | 533.69 | 14.077266311585 | 14.597009320905 |
9994548 | 8412.5731827431 | 1.9026361479098 | 513.48871627821 | 540.7625 | 13.878481333333 | 14.53224 |
means | 8324.3704747887 | 1.9235708812129 | 503.92358636369 | 535.97875 | 13.98938086538 | 14.56046929836 |
medians | 8343.6051049719 | 1.9189288045183 | 505.87561329222 | 536.685 | 13.921374108576 | 14.528107004104 |
confidence_interval | 0.010993300632059 | 0.011362164570798 | 0.024010815613053 | 0.012185469952163 | 0.0081766615307008 | 0.0082655166059187 |
min | 8119.8086062203 | 1.8954390538826 | 482.52 | 524.27 | 13.878481333333 | 14.43093608522 |
max | 8441.6219797253 | 1.9734693630701 | 523.06 | 547.5725 | 14.281521505376 | 14.873037634409 |
stddev | 109.44435177091 | 0.026138647857235 | 14.470563630049 | 7.8109472171159 | 0.13680071373816 | 0.14393261774689 |
Seeing a 4% drop in throughput on Power.
@andrewcraik @zl-wang see above overhead(s)
I have also updated the original post for SPECjbb2015GMR
. I don't think we can draw any conclusions from that particular benchmark as we always seem to have large fluctuations (multiple attempts and not a small dataset considering each run takes over 3 hours).
most of variability typically comes from JIT, but this heap size is also fairly small for jbb2015 and may contribute (by having large variations in number of global GCs that are relatively expensive) as well.
if these tests are to be repeated try heap as big as possible while still being able to run shift0, with 2GB given to Tenure and rest to Nursery (for example -Xmx3200M -Xms3200M -Xmn1200M)
maybe due to the shift-3 case's bigger measurement variability, the overhead looked like twice more than expected. we had prior experience with this overhead ... about 2-2.5%. it might be worth of another more stable measurement on shift-3 case.
most of variability typically comes from JIT, but this heap size is also fairly small for jbb2015 and may contribute (by having large variations in number of global GCs that are relatively expensive) as well.
if these tests are to be repeated try heap as big as possible while still being able to run shift0, with 2GB given to Tenure and rest to Nursery (for example -Xmx3200M -Xms3200M -Xmn1200M)
Tried with -Xmx3200M -Xms3200M -Xmn1200M
on x86
The shift0 runs were pretty stable, the shift3 runs were not.
3.2% drop in max_jOPS and 2% drop in cirtical_jOPS
Going to give this another try..
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
---|---|---|---|---|
9994645 | 18938 | 11502 | 23095 | 22360 |
9994646 | 18938 | 11227 | 23095 | 21847 |
9994647 | 19169 | 10885 | 23095 | 21419 |
9994648 | 21247 | 11684 | 23095 | 19279 |
means | 19573 | 11324.5 | 23095 | 21226.25 |
medians | 19053.5 | 11364.5 | 23095 | 21633 |
confidence_interval | 0.09114537985632 | 0.048897960523052 | 0 | 0.1014854988189 |
min | 18938 | 10885 | 23095 | 19279 |
max | 21247 | 11684 | 23095 | 22360 |
stddev | 1121.3001382324 | 348.04836828617 | 1353.9639027685 |
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
---|---|---|---|---|
9994636 | 20555 | 12086 | 23095 | 22548 |
9994637 | 20093 | 11388 | 23095 | 22360 |
9994638 | 20202 | 11982 | 27674 | 23095 |
9994639 | 20093 | 11496 | 23095 | 22706 |
means | 20235.75 | 11738 | 24239.75 | 22677.25 |
medians | 20147.5 | 11739 | 23095 | 22627 |
confidence_interval | 0.017214402858354 | 0.047064354097083 | 0.15027360018152 | 0.021914250955476 |
min | 20093 | 11388 | 23095 | 22360 |
max | 20555 | 12086 | 27674 | 23095 |
stddev | 218.94805319984 | 347.22903104435 | 2289.5 | 312.35383248276 |
Do you use -XXgc:forcedShiftingCompressionAmount=3
to force shift 3 ?
Do you use
-XXgc:forcedShiftingCompressionAmount=3
to force shift 3 ?
Yes
With a larger dataset, I'm measuring a 3.5% throughput drop on Power.
First 8 runs: (8288.3387340932 vs. 8581.7531089419) Next 8 runs: (8316.9258529547 vs. 8627.474541907) Both have 3.5% throughput drop.
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time |
---|---|---|---|---|---|---|
9994855 | 8227.0338779673 | 1.9450356503132 | 503.18874202814 | 526.035 | 13.904885598923 | 14.4253243607 |
9994856 | 8087.2274532643 | 1.9783969421294 | 494.9975 | 517.92093638361 | 13.935554945055 | 14.535127747253 |
9994857 | 8317.3519463409 | 1.9245225413624 | 500.05874985313 | 537.415 | 14.015772543742 | 14.649139973082 |
9994858 | 8286.6920988556 | 1.9310648048965 | 504.66 | 527.8886802783 | 14.190730201342 | 14.867684563758 |
9994859 | 8207.8741850326 | 1.9494444556917 | 506.135 | 519.855 | 13.985191117093 | 14.581004037685 |
9994860 | 8392.4408795395 | 1.907375533393 | 503.745 | 545.16 | 13.938728 | 14.444909333333 |
9994861 | 8334.6407679616 | 1.920294696344 | 499.66 | 533.34116664708 | 14.107698795181 | 14.611708165997 |
9994862 | 8453.4486637834 | 1.8928913679089 | 509.36872657818 | 532.835 | 14.051724842767 | 14.721377358491 |
means | 8288.3387340932 | 1.9311282490049 | 502.72671480743 | 530.05634791362 | 14.016285755513 | 14.604534442537 |
medians | 8302.0220225982 | 1.9277936731294 | 503.46687101407 | 530.36184013915 | 14.000481830417 | 14.596356101841 |
confidence_interval | 0.011550726630632 | 0.011514626729118 | 0.0073576725719721 | 0.014272083518414 | 0.0057905694289637 | 0.0083223576011986 |
min | 8087.2274532643 | 1.8928913679089 | 494.9975 | 517.92093638361 | 13.904885598923 | 14.4253243607 |
max | 8453.4486637834 | 1.9783969421294 | 509.36872657818 | 545.16 | 14.190730201342 | 14.867684563758 |
stddev | 114.49608734331 | 0.026593458983634 | 4.4237061399032 | 9.0473890683649 | 0.097066208198194 | 0.14536101225863 |
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time |
9994868 | 8429.0328467609 | 1.9004089367754 | 506.765 | 552.32 | 14.128936925099 | 14.630515111695 |
9994869 | 8687.8546821798 | 1.8426826705212 | 522.0325 | 564.32608918478 | 13.933602791878 | 14.440649746193 |
9994870 | 8468.1746712228 | 1.8895803645447 | 514.3225 | 542.93614265964 | 14.165972440945 | 14.769738845144 |
9994871 | 8627.8107263089 | 1.8540877613996 | 525.47618630953 | 544.83613790966 | 13.899932484076 | 14.414615286624 |
9994872 | 8670.780576214 | 1.8459311333678 | 526.85723142769 | 555.88111029722 | 14.051856780735 | 14.553642585551 |
9994873 | 8555.9774620284 | 1.8699833702903 | 527.8925 | 541.9025 | 13.964604139715 | 14.498144890039 |
9994874 | 8579.5265630195 | 1.8652614005592 | 516.575 | 546.735 | 13.892619607843 | 14.611988235294 |
9994875 | 8634.8673438012 | 1.854260194939 | 509.6125 | 568.4310789223 | 13.914638569604 | 14.44150063857 |
means | 8581.7531089419 | 1.8652744790497 | 518.69167721715 | 552.1710073717 | 13.994020467487 | 14.545099417389 |
medians | 8603.6686446642 | 1.8597607977491 | 519.30375 | 549.5275 | 13.949103465797 | 14.525893737795 |
confidence_interval | 0.0090968857000075 | 0.0092466689375262 | 0.013021956977964 | 0.015143005290046 | 0.0064300747604867 | 0.0069793709871253 |
min | 8429.0328467609 | 1.8426826705212 | 506.765 | 541.9025 | 13.892619607843 | 14.414615286624 |
max | 8687.8546821798 | 1.9004089367754 | 527.8925 | 568.4310789223 | 14.165972440945 | 14.769738845144 |
stddev | 93.364677712505 | 0.02062727721853 | 8.0779169549425 | 9.9999889949771 | 0.107614892342 | 0.12140786619901 |
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time |
9994930 | 8177.5107248928 | 1.9590636485304 | 487.47 | 541.2825 | 14.013406035665 | 14.80304526749 |
9994931 | 8432.8923393013 | 1.8986411307841 | 507.995 | 551.18 | 13.983329333333 | 14.492921333333 |
9994932 | 8228.0696491294 | 1.9454009387451 | 489.7725 | 529.6475 | 13.910288227334 | 14.588560216509 |
9994933 | 8403.3447909506 | 1.9049858210628 | 508.8425 | 543.6725 | 13.894525827815 | 14.501691390728 |
9994934 | 8385.7767133493 | 1.9085944995934 | 508.14622963443 | 541.58364604088 | 13.929327516778 | 14.648150335571 |
9994935 | 8330.2474844166 | 1.9258762781886 | 486.82378294054 | 558.22860442849 | 14.191576974565 | 14.858611780455 |
9994936 | 8267.1334929976 | 1.9375925939066 | 488.9325 | 546.98 | 14.161099319728 | 14.669331972789 |
9994937 | 8310.4316285999 | 1.9272640311479 | 494.9175 | 544.84113789716 | 14.038435549525 | 14.781244233378 |
means | 8316.9258529547 | 1.9259273677449 | 496.61250157187 | 544.67698604582 | 14.015248598093 | 14.667944566282 |
medians | 8320.3395565083 | 1.9265701546682 | 492.345 | 544.25681894858 | 13.998367684499 | 14.65874115418 |
confidence_interval | 0.0089671497091839 | 0.0091344216653754 | 0.016839629548937 | 0.012702234285658 | 0.0066472032951229 | 0.0078402754117856 |
min | 8177.5107248928 | 1.8986411307841 | 486.82378294054 | 529.6475 | 13.894525827815 | 14.492921333333 |
max | 8432.8923393013 | 1.9590636485304 | 508.8425 | 558.22860442849 | 14.191576974565 | 14.858611780455 |
stddev | 89.19306714941 | 0.021039470646772 | 10.001474451658 | 8.2743329580123 | 0.11141755278116 | 0.13753537856553 |
Job ID | Global Throughput | Average response time | Min TPS | Max TPS | Pause Time | Total Pause Time |
9994917 | 8635.7070897714 | 1.853296592018 | 524.31 | 556.5375 | 14.183319371728 | 14.806257853403 |
9994918 | 8621.7088487345 | 1.8558855922125 | 531.7725 | 545.95 | 13.98879791395 | 14.501852672751 |
9994919 | 8533.2685029174 | 1.8753910779895 | 522.7925 | 546.08863477841 | 13.958782664942 | 14.470852522639 |
9994920 | 8574.0599079032 | 1.8675608874445 | 513.61871595321 | 564.5675 | 13.962389175258 | 14.472323453608 |
9994921 | 8510.2283664363 | 1.8815516956704 | 505.67 | 550.7275 | 14.622058124174 | 15.321787318362 |
9994922 | 8703.9959200816 | 1.8385274078447 | 526.2775 | 557.65 | 13.938370656371 | 14.615827541828 |
9994923 | 8781.1981281123 | 1.8221623975359 | 535.02 | 559.17860205349 | 14.360284634761 | 14.88064231738 |
9994924 | 8659.6295712997 | 1.8498789700963 | 506.085 | 569.3425 | 14.483114068441 | 14.988921419518 |
means | 8627.474541907 | 1.8555318276015 | 520.69327699415 | 556.25527960399 | 14.187139576203 | 14.757308137436 |
medians | 8628.707969253 | 1.8545910921153 | 523.55125 | 557.09375 | 14.086058642839 | 14.711042697615 |
confidence_interval | 0.008675996428163 | 0.0087775211883997 | 0.017859539791627 | 0.012590065519452 | 0.015918315024142 | 0.017111525799816 |
min | 8510.2283664363 | 1.8221623975359 | 505.67 | 545.95 | 13.938370656371 | 14.470852522639 |
max | 8781.1981281123 | 1.8815516956704 | 535.02 | 569.3425 | 14.622058124174 | 15.321787318362 |
stddev | 89.519345731438 | 0.019478438704893 | 11.121569557209 | 8.37560108854 | 0.27008830852039 | 0.30200193835895 |
SPECjbb2015 on x86. No throughput drop observed this time. (I grabbed the build from a different location this time, non-source code version of that build)
-Xmx3200M -Xms3200M -Xmn1200M
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
---|---|---|---|---|
9994899 | 20202 | 11928 | 27674 | 23095 |
9994900 | 19649 | 11962 | 23392 | 22775 |
9994901 | 20324 | 11498 | 23095 | 21847 |
9994902 | 20479 | 11678 | 27674 | 23095 |
means | 20163.5 | 11766.5 | 25458.75 | 22703 |
medians | 20263 | 11803 | 25533 | 22935 |
confidence_interval | 0.028503988445181 | 0.029647326634529 | 0.16003411426044 | 0.041365280701093 |
min | 19649 | 11498 | 23095 | 21847 |
max | 20479 | 11962 | 27674 | 23095 |
stddev | 361.24460780289 | 219.26163975184 | 2560.8224427581 | 590.26773586229 |
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
---|---|---|---|---|
9995099 | 20913 | 11952 | 24604 | 23716 |
9995100 | 19631 | 11039 | 23095 | 21062 |
9995101 | ||||
9995102 | ||||
means | 20272 | 11495.5 | 23849.5 | 22389 |
medians | 20272 | 11495.5 | 23849.5 | 22389 |
confidence_interval | 0.14229072923525 | 0.17870145527142 | 0.14236234682501 | 0.26671743115415 |
min | 19631 | 11039 | 23095 | 21062 |
max | 20913 | 11952 | 24604 | 23716 |
stddev | 906.51089348115 | 645.58849122332 | 1067.0241328105 | 1876.6613972691 |
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
---|---|---|---|---|
9994908 | 20785 | 12163 | 23095 | 20517 |
9994909 | 20093 | 11586 | 23095 | 22360 |
9994910 | 20785 | 11861 | 23095 | 21847 |
9994911 | 20324 | 12112 | 23095 | 20517 |
means | 20496.75 | 11930.5 | 23095 | 21310.25 |
medians | 20554.5 | 11986.5 | 23095 | 21182 |
confidence_interval | 0.026852923897151 | 0.035325331114737 | 0 | 0.070149805133889 |
min | 20093 | 11586 | 23095 | 20517 |
max | 20785 | 12163 | 23095 | 22360 |
stddev | 345.94448013133 | 264.89557691035 | 939.60395025422 |
Job ID | max_jOPS | critical_jOPS | hbIR_max | hbIR_settled |
---|---|---|---|---|
9995108 | 20755 | 12015 | 27674 | 23095 |
9995109 | 21309 | 12721 | 27674 | 23095 |
9995110 | 21016 | 12134 | 23095 | 21847 |
9995111 | 20755 | 12249 | 27674 | 23095 |
means | 20958.75 | 12279.75 | 26529.25 | 22783 |
medians | 20885.5 | 12191.5 | 27674 | 23095 |
confidence_interval | 0.020035367673363 | 0.040072637108334 | 0.13730484276789 | 0.043575648509854 |
min | 20755 | 12015 | 23095 | 21847 |
max | 21309 | 12721 | 27674 | 23095 |
stddev | 263.93228298183 | 309.29099027723 | 2289.5 | 624 |
Judging from the various experiments we tried, the overhead of shift3 on x86 isn't very significant.
There are two approaches that were brought up in the Portable SCC discussion regarding how to deal with the potential for the compressed refs shift changing with the heap size.