dsouzai commented 4 years ago

There are two approaches that were brought up in the Portable SCC discussion regarding how to deal with the potential for the compressed refs shift changing with the heap size.

Have the JIT assume that the compressed shift might be 4. The generated code then loads the shift value into a register. This load can then be relocated.
Fix the shift value to 3 if the JVM is going to use AOT code.

DanHeidinga commented 4 years ago

fyi @dmitripivkine @amicic

Can one of you comment on what would be affected in the GC (and rest of the runtime) by forcing the CR shift to always be 3? How would this affect the heap size? Occupancy? More wasted space? etc

dsouzai commented 4 years ago

also fyi @fjeremic @andrewcraik @gita-omr @knn-k to give a codegen perspective.

amicic commented 4 years ago

For a large part GC would not be affected (roots and metastructures like rembembered set do not use CR). The biggest performance impact, I geuess, would come from jitted code, which is more in JIT folks to comment, like how an unnecessary shift for <4GB heap would affect performance.

amicic commented 4 years ago

I don't see obvious footprint implications.

amicic commented 4 years ago

Don't remember if this was discussed already, but what I struggle more is, jitted code (non)portability in unified CR/nonCR VM.

dsouzai commented 4 years ago

jitted code (non)portability in unified CR/nonCR VM.

This was brought up in the discussion; running in CR / non-CR results in a different SCC. I suppose theoretically the same SCC could be used to store both CR and non-CR versions of a compiled method, but that's better discussed in another issue (which I can open if we feel it's a discussion worth having).

DanHeidinga commented 4 years ago

Don't remember if this was discussed already, but what I struggle more is, jitted code (non)portability in unified CR/nonCR VM.

The CR-ness is encoded in the cache name today so that they are forced to be separate caches. I expect the initial unified CR/nonCR VM will still use separate caches for the initial approach.

dmitripivkine commented 4 years ago

fyi @dmitripivkine @amicic

Can one of you comment on what would be affected in the GC (and rest of the runtime) by forcing the CR shift to always be 3? How would this affect the heap size? Occupancy? More wasted space? etc

As Aleks mentioned before lost of performance for cases heap can be allocated below 4g bar and can run 0-bit shift
also lost support for heaps up to 64g using 4-bit shift

fjeremic commented 4 years ago

From the codegen perspective proposed solution 1. is going to be less performant and harder to functionally implement correctly. 2. will give most performance but least flexibility.

harryyu1994 commented 4 years ago

@fjeremic @andrewcraik Do you guys have any issues/concerns with moving forward with solution 2 here (fixing the shift value to 3 for portable AOT).

Summary of the solution:

When -XX:+PortableSharedCache is specified, the compressedref shift will be fixed to 3 (3 for <= 32g and 4 for > 32g and nocompressedref > 64g. I could be wrong on these numbers.).
If user always operates under 32gb then there will be no problem
The only limitation is If it was set to 3 and the heap size was later increased to more than 32gb (4-bit shift) then we will need to deem the AOT code in the existing cache/layers unusable. (Not sure how often this scenario occurs)

FYI @mpirvu @vijaysun-omr @ymanton @dsouzai

dsouzai commented 4 years ago

the compressedref shift will be fixed to 3 (3 for <= 32gb and 4 for > 32gb).

Worth noting that there is a point when even shift 4 won't work, eg if the heap is so big that we have to run without compressedrefs; I don't know if the JVM will still explicitly require the user to pass in -Xnocompressedrefs once we have one build that can do both.

If we can't use compressedrefs, what should the portable AOT story be? Should we generate a SCC with the default CPU features & nocompressedrefs (thereby mandating that all future layers must have compressedrefs disabled)?

harryyu1994 commented 4 years ago

If we can't use compressedrefs, what should the portable AOT story be? Should we generate a SCC with the default CPU features & nocompressedrefs (thereby mandating that all future layers must have compressedrefs disabled)?

It sounds reasonable to me as we are giving the same treatment to shift 3/shift 4/nocompressedrefs. This shouldn't be that bad if most of the use cases fall under shift 3.

pshipton commented 4 years ago

Note compressedrefs and non-compressedrefs don't share any cache, there are different cache files for these atm.

vijaysun-omr commented 4 years ago

Is it possible to estimate the impact of applying a constant shift by 3 in compiled code ? In general I am supportive of this proposed direction that you are taking but I wanted to get a sense for how much slower we would be making portable AOT code because of compressed refs shift. My expectation/hope would be that the overhead is'nt much more than 5% but this is what I'm asking to be measured.

harryyu1994 commented 4 years ago

Is it possible to estimate the impact of applying a constant shift by 3 in compiled code ? In general I am supportive of this proposed direction that you are taking but I wanted to get a sense for how much slower we would be making portable AOT code because of compressed refs shift. My expectation/hope would be that the overhead is'nt much more than 5% but this is what I'm asking to be measured.

Daytrader7

liberty version: liberty-20.0.0.2_wlp_webProfile7
seeing a 1% drop here (and that could be due to the fluctuations), will do another run and update the result again tomorrow

CompressedShift = 0

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=2850.45     min=2759.30     max=2900.90     stdDev=45.5     maxVar=5.13%    confInt=1.07%   samples= 8
Intermediate results:
Run 0   187.3   2575.9  2890.0  2861.1  Avg=2861        CPU=113492 ms  Footprint=712364 KB
Run 1   200.6   2573.6  2887.4  2867.5  Avg=2868        CPU=108388 ms  Footprint=700740 KB
Run 2   221.0   2579.6  2901.8  2759.3  Avg=2759        CPU=112528 ms  Footprint=699352 KB
Run 3   222.1   2628.1  2830.4  2892.9  Avg=2893        CPU=107786 ms  Footprint=706204 KB
Run 4   180.4   2628.9  2903.8  2830.5  Avg=2830        CPU=108510 ms  Footprint=706704 KB
Run 5   226.5   2598.6  2705.5  2867.6  Avg=2868        CPU=112368 ms  Footprint=713240 KB
Run 6   221.2   2647.2  2837.1  2900.9  Avg=2901        CPU=110313 ms  Footprint=698736 KB
Run 7   231.8   2619.8  2928.2  2823.8  Avg=2824        CPU=110404 ms  Footprint=707608 KB
CompTime        avg=110473.62   min=107786.00   max=113492.00   stdDev=2150.7   maxVar=5.29%    confInt=1.30%   samples= 8
Footprint       avg=705618.50   min=698736.00   max=713240.00   stdDev=5599.8   maxVar=2.08%    confInt=0.53%   samples= 8

CompressedShift = 3

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=2819.82     min=2776.80     max=2861.30     stdDev=29.0     maxVar=3.04%    confInt=0.69%   samples= 8
Intermediate results:
Run 0   163.2   2442.4  2795.4  2776.8  Avg=2777        CPU=137886 ms  Footprint=651608 KB
Run 1   144.9   2350.9  2826.5  2847.1  Avg=2847        CPU=137913 ms  Footprint=636708 KB
Run 2   152.9   2429.0  2857.6  2826.9  Avg=2827        CPU=131592 ms  Footprint=637768 KB
Run 3   174.9   2363.9  2790.5  2832.9  Avg=2833        CPU=140504 ms  Footprint=642980 KB
Run 4   161.6   2433.0  2810.8  2803.1  Avg=2803        CPU=132384 ms  Footprint=632412 KB
Run 5   139.9   2409.7  2819.2  2861.3  Avg=2861        CPU=132907 ms  Footprint=649168 KB
Run 6   177.9   2467.1  2801.9  2787.1  Avg=2787        CPU=137302 ms  Footprint=636512 KB
Run 7   178.0   2431.0  2764.5  2823.4  Avg=2823        CPU=133845 ms  Footprint=638280 KB
CompTime        avg=135541.62   min=131592.00   max=140504.00   stdDev=3256.5   maxVar=6.77%    confInt=1.61%   samples= 8
Footprint       avg=640679.50   min=632412.00   max=651608.00   stdDev=6681.6   maxVar=3.04%    confInt=0.70%   samples= 8

AcmeAir in Docker

Shift0:

run0: summary = 2938081 in  600s = 4896.7/s Avg:   1 Min:   0 Max:  891 Err:   0 (0.00%)
run1: summary = 3136727 in  600s = 5227.7/s Avg:   1 Min:   0 Max:  129 Err:   0 (0.00%)
run2: summary = 3147370 in  600s = 5245.4/s Avg:   1 Min:   0 Max:  109 Err:   0 (0.00%)
run3: summary = 3139280 in  600s = 5232.0/s Avg:   1 Min:   0 Max:  117 Err:   0 (0.00%)
run4: summary = 3133830 in  600s = 5222.8/s Avg:   1 Min:   0 Max:  79 Err:   0 (0.00%)
run5: summary = 3136712 in  600s = 5227.7/s Avg:   1 Min:   0 Max:  156 Err:   0 (0.00%)

5231.12 
Shift3:

run0: summary = 2964754 in  600s = 4941.1/s Avg:   1 Min:   0 Max:  260 Err:   0 (0.00%)
run1: summary = 3137234 in  600s = 5228.3/s Avg:   1 Min:   0 Max:  124 Err:   0 (0.00%)
run2: summary = 3126874 in  600s = 5211.3/s Avg:   1 Min:   0 Max:  110 Err:   0 (0.00%)
run3: summary = 3139452 in  600s = 5232.2/s Avg:   1 Min:   0 Max:  64 Err:   0 (0.00%)
run4: summary = 3134675 in  600s = 5224.3/s Avg:   1 Min:   0 Max:  100 Err:   0 (0.00%)
run5: summary = 3139328 in  600s = 5232.1/s Avg:   1 Min:   0 Max:  113 Err:   0 (0.00%)

5225.64

No throughput drop observed

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:none -Xnoaot -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=2997.16     min=2976.00     max=3022.10     stdDev=13.9     maxVar=1.55%    confInt=0.31%   samples= 8
Intermediate results:
Run 0   161.6   2661.4  3026.7  3022.1  Avg=3022        CPU=131266 ms  Footprint=597944 KB
Run 1   175.7   2600.5  3029.5  2990.0  Avg=2990        CPU=127742 ms  Footprint=594136 KB
Run 2   178.5   2622.8  3002.0  2984.6  Avg=2985        CPU=130536 ms  Footprint=602824 KB
Run 3   161.8   2686.5  3003.7  3000.8  Avg=3001        CPU=129596 ms  Footprint=596588 KB
Run 4   131.0   2617.7  2820.5  2976.0  Avg=2976        CPU=143361 ms  Footprint=603160 KB
Run 5   157.4   2657.5  3016.9  2999.4  Avg=2999        CPU=129978 ms  Footprint=604736 KB
Run 6   175.0   2656.1  2978.4  3001.4  Avg=3001        CPU=130472 ms  Footprint=592384 KB
Run 7   192.8   2599.7  3042.6  3003.0  Avg=3003        CPU=130239 ms  Footprint=613256 KB
CompTime        avg=131648.75   min=127742.00   max=143361.00   stdDev=4843.3   maxVar=12.23%   confInt=2.46%   samples= 8
Footprint       avg=600628.50   min=592384.00   max=613256.00   stdDev=6774.0   maxVar=3.52%    confInt=0.76%   samples= 8

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:none -Xnoaot -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=2933.98     min=2877.50     max=2974.10     stdDev=29.9     maxVar=3.36%    confInt=0.68%   samples= 8
Intermediate results:
Run 0   173.7   2609.1  2960.8  2918.4  Avg=2918        CPU=133268 ms  Footprint=599836 KB
Run 1   179.3   2641.4  2952.0  2920.2  Avg=2920        CPU=128775 ms  Footprint=597212 KB
Run 2   154.8   2589.6  2955.9  2956.3  Avg=2956        CPU=131750 ms  Footprint=606756 KB
Run 3   191.9   2553.4  2955.2  2945.6  Avg=2946        CPU=129687 ms  Footprint=597068 KB
Run 4   149.1   2640.5  2965.7  2974.1  Avg=2974        CPU=129883 ms  Footprint=600332 KB
Run 5   193.3   2638.6  2941.3  2927.3  Avg=2927        CPU=127431 ms  Footprint=603852 KB
Run 6   168.7   2530.4  2892.4  2877.5  Avg=2878        CPU=145817 ms  Footprint=596084 KB
Run 7   176.7   2569.6  2945.2  2952.4  Avg=2952        CPU=134123 ms  Footprint=596880 KB
CompTime        avg=132591.75   min=127431.00   max=145817.00   stdDev=5798.9   maxVar=14.43%   confInt=2.93%   samples= 8
Footprint       avg=599752.50   min=596084.00   max=606756.00   stdDev=3809.2   maxVar=1.79%    confInt=0.43%   samples= 8

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=2964.54     min=2933.90     max=3002.30     stdDev=25.2     maxVar=2.33%    confInt=0.57%   samples= 8
Intermediate results:
Run 0   220.7   2656.6  2940.8  2933.9  Avg=2934        CPU=103509 ms  Footprint=712676 KB
Run 1   228.4   2674.1  2920.4  2950.9  Avg=2951        CPU=105973 ms  Footprint=719892 KB
Run 2   223.8   2727.0  2960.8  2947.2  Avg=2947        CPU=103124 ms  Footprint=704384 KB
Run 3   215.6   2704.4  2978.5  2978.7  Avg=2979        CPU=103663 ms  Footprint=709576 KB
Run 4   235.9   2666.1  2967.8  3002.3  Avg=3002        CPU=103964 ms  Footprint=710316 KB
Run 5   218.4   2676.8  2964.8  2997.3  Avg=2997        CPU=101415 ms  Footprint=704660 KB
Run 6   176.1   2719.4  2953.1  2958.0  Avg=2958        CPU=103691 ms  Footprint=726336 KB
Run 7   214.4   2654.4  2957.3  2948.0  Avg=2948        CPU=106512 ms  Footprint=714952 KB
CompTime        avg=103981.38   min=101415.00   max=106512.00   stdDev=1608.1   maxVar=5.03%    confInt=1.04%   samples= 8
Footprint       avg=712849.00   min=704384.00   max=726336.00   stdDev=7481.4   maxVar=3.12%    confInt=0.70%   samples= 8

Results for JDK=/home/harryayu2/compressedShift/j2sdk-image jvmOpts=-Xshareclasses:name=liberty -Xscmx400M -Xscmaxaot256M -Xmx1G -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=2889.20     min=2842.80     max=2943.50     stdDev=37.0     maxVar=3.54%    confInt=0.86%   samples= 8
Intermediate results:
Run 0   181.7   2536.2  2979.7  2933.4  Avg=2933        CPU=123566 ms  Footprint=693680 KB
Run 1   185.0   2523.2  2887.1  2910.4  Avg=2910        CPU=128682 ms  Footprint=640464 KB
Run 2   178.7   2494.4  2877.1  2943.5  Avg=2944        CPU=128197 ms  Footprint=637768 KB
Run 3   175.7   2553.8  2879.9  2889.7  Avg=2890        CPU=129298 ms  Footprint=639216 KB
Run 4   157.6   2498.3  2862.5  2875.5  Avg=2876        CPU=123893 ms  Footprint=640976 KB
Run 5   187.6   2497.7  2868.2  2842.8  Avg=2843        CPU=125097 ms  Footprint=632568 KB
Run 6   178.9   2420.8  2828.6  2865.9  Avg=2866        CPU=127235 ms  Footprint=637264 KB
Run 7   173.4   2305.5  2862.9  2852.4  Avg=2852        CPU=119157 ms  Footprint=636492 KB
CompTime        avg=125640.62   min=119157.00   max=129298.00   stdDev=3410.0   maxVar=8.51%    confInt=1.82%   samples= 8
Footprint       avg=644803.50   min=632568.00   max=693680.00   stdDev=19923.9  maxVar=9.66%    confInt=2.07%   samples= 8

harryyu1994 commented 4 years ago

I have spent some time coming up with an implementation and here's the whole story:

First, recap on the original compressed shift design:

Users specify -XX:+PortableSharedCache during cold run, if the compressed shift value is <=3 then 3 will be used and persisted to the shared class cache, if the compressed shift value is 4 then 4 will be used
During warm runs, users can pick up the compressed shift value from the shared class cache and use that for all the AOT compilations (We didn't decide whether we want to do it by default for all AOT compilations or just for users who specified -XX:+PotableSharedCache during the warm run)

I proceeded to implement this, and have found some limitations with our existing infrastructure in the codebase:

It looks like the compressed shift value is calculated and set very early in initializeRunTimeObjectAlignmentAndCRShift(), it is in fact earlier than the earliest point the vm is able to load the SCC. As a result, with the current infrastructure it may not be possible to pick up the CR shift value from the SCC and then set it to the current JVM.
Another minor issue is that the -XX:+PortableSharedCache is yet to be parsed and processed when initializeRunTimeObjectAlignmentAndCRShift() is called, but this looks possible to be worked around without too much effort.

Due to the limitations, I'm proposing an alternative solution

When users specify -XX:+PortableSharedCache, if the compressed shift value <= 3 we will fix the compressed shift to 3 and if the compressed shift value is 4 then we will generate a warning message to the user that the heap may be too large for the portableSharedCache option to work.
We will never check the SCC for the compressed shift value and will be only relying on the portableSharedCache option
The only downside is we no longer support the compressed shift 4 case. I don't think this is too big of a problem and also the throughput drop for compressed shift 4 may not even be acceptable for us (just guessing here).
This solution may be much easier to implement than the original one and makes the original design look over-engineered.

@vijaysun-omr @mpirvu @dsouzai Let me know if there are any concerns with the alternative solution.

DanHeidinga commented 4 years ago

@dmitripivkine Does the CR shift change the object alignment requirements? Does forcing a 3bit shift mean small heaps will waste more space and therefore incur higher gc overhead?

mpirvu commented 4 years ago

if the compressed shift value is 4 then we will generate a warning message to the user that the heap may be too large for the portableSharedCache option to work.

I assume we will continue to generate AOT which is portable from the processor point of view.

dmitripivkine commented 4 years ago

@dmitripivkine Does the CR shift change the object alignment requirements? Does forcing a 3bit shift mean small heaps will waste more space and therefore incur higher gc overhead?

only for 4-bit shift (required 16 bytes alignment for objects). All other cases are covered by minimum heap object alignment to 8 bytes

harryyu1994 commented 4 years ago

I assume we will continue to generate AOT which is portable from the processor point of view.

Yes, we will always use the portable processor feature set when -XX:+PortableSharedCache is specified.

dmitripivkine commented 4 years ago

@DanHeidinga @vijaysun-omr Moving question here: https://github.com/eclipse/omr/pull/5436 force any run in container to use shift 3 (except it needs to use shift 4). It obviously prevents to use most performant shift 0 for small container applications. Would you please confirm this is desired behaviour?

harryyu1994 commented 4 years ago

I think in containers the plan is to sacrifice a bit of performance in exchange for maximum portability. I have run some experiment comparing shift0 and shift3 and didn't see a significant throughput drop. I'll leave the decision to Vijay @vijaysun-omr though.

DanHeidinga commented 4 years ago

Moving question here: eclipse/omr#5436 force any run in container to use shift 3 (except it needs to use shift 4). It obviously prevents to use most performant shift 0 for small container applications. Would you please confirm this is desired behaviour?

My understanding is that this is the expected behaviour only when -XX:+PortableSharedCache is specified and in that case we accept the tradeoff for better portability.

harryyu1994 commented 4 years ago

My understanding is that this is the expected behaviour only when -XX:+PortableSharedCache is specified and in that case we accept the tradeoff for better portability.

Yes that is correct. But in containers the PortableSharedCache feature is enabled by default. In containers, the portable processor feature set will be used by default for AOT compilations unless disabled by -XX:-PortableSharedCache. The question here is whether we want to also have the shift set to 3 by default in containers.

mpirvu commented 4 years ago

The question here is whether we want to also have the shift set to 3 by default in containers.

I would say yes, but only if AOT is enabled.

vijaysun-omr commented 4 years ago

The shift by 3 code is only generated for an AOT compilation in containers.

So, in the unaffected category are : 1) JIT compilations inside or outside containers and 2) AOT compilations outside containers.

This is a conscious choice being made in the same philosophy as the change to use "portable processor feature set", i.e. we want to make the code portable at a small performance cost in containers (in the case of compressed refs shift, the cost was negligible as @harryyu1994 measured). Since AOT compilations can (and are) usually recompiled as JIT compilations if they are deemed important for peak throughput (since this is not the first way and also nor the most significant way that AOT compilations are worse than JIT compilations), the steady state impact won't even be as much as measured in the AOT experiments with and without portability changes. During startup and rampup phases when AOT code is used heavily, these minor performance differences are unlikely to be a big enough deal to compromise on portability in containers.

harryyu1994 commented 4 years ago

The shift by 3 code is only generated for an AOT compilation in containers.

So, in the unaffected category are : 1) JIT compilations inside or outside containers and 2) AOT compilations outside containers.

This is a conscious choice being made in the same philosophy as the change to use "portable processor feature set", i.e. we want to make the code portable at a small performance cost in containers (in the case of compressed refs shift, the cost was negligible as @harryyu1994 measured). Since AOT compilations can (and are) usually recompiled as JIT compilations if they are deemed important for peak throughput (since this is not the first way and also nor the most significant way that AOT compilations are worse than JIT compilations), the steady state impact won't even be as much as measured in the AOT experiments with and without portability changes. During startup and rampup phases when AOT code is used heavily, these minor performance differences are unlikely to be a big enough deal to compromise on portability in containers.

Correct me if I'm wrong, but I thought the JIT compilations inside containers would also have to use shift3 if we made the AOT compilations shift by 3. So JIT compilations inside containers are affected (though we didn't see a throughput drop in my experiment when comparing AOT+JIT shift0 vs. AOT+JIT shift3).

harryyu1994 commented 4 years ago

The question here is whether we want to also have the shift set to 3 by default in containers.

I would say yes, but only if AOT is enabled.

May not be possible to check for whether AOT is enabled this early.

enum INIT_STAGE {
    PORT_LIBRARY_GUARANTEED,           0
    ALL_DEFAULT_LIBRARIES_LOADED,   1
    ALL_LIBRARIES_LOADED,       2
    DLL_LOAD_TABLE_FINALIZED,   3  Consume JIT specific X options
    VM_THREADING_INITIALIZED,       4
    HEAP_STRUCTURES_INITIALIZED,    5
    ALL_VM_ARGS_CONSUMED,       6

The shift is set at ALL_LIBRARIES_LOADED, very early into the initialization.

mpirvu commented 4 years ago

Looking at the code https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L34-L66 I see that vm->sharedCacheAPI->sharedCacheEnabled is set very early and SCC options are also parsed very early. But then there is this piece of code in the same function: https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L304-L328 which seems to deal with -Xshareclasses:none option. I wonder why can't this piece of code be grouped with the previous one. That way we would know very early whether SCC is 'likely' going to be used. @hangshao0

vijaysun-omr commented 4 years ago

@harryyu1994 Discussing with @mpirvu some more, I feel that we need more data points if we are going to slow down JITed code (in addition to AOTed code) inside containers. Could you please run SPECjbb2015 (please ask Piyush if you need help with accessing a setup for it) and maybe SPECjbb2005 (that is much easier to run) and check what the throughput overhead is ?

Additionally, the overhead of the shift would be platform dependent and so if one wanted to take a design decision for all platforms, the effect of the shift ought to be measured on the other platforms first.

mpirvu commented 4 years ago

I would also add quarkus throughput experiments since quarkus is more likely to be run in containers.

hangshao0 commented 4 years ago

But then there is this piece of code in the same function: https://github.com/eclipse/openj9/blob/master/runtime/shared/shrclssup.c#L304-L328 which seems to deal with -Xshareclasses:none option. I wonder why can't this piece of code be grouped with the previous one. That way we would know very early whether SCC is 'likely' going to be used.

Looking at the code, what it does is to unload the SCC dll if -Xshareclasses:none presents. I guess that is the reason why it is done in stage DLL_LOAD_TABLE_FINALIZED. Once the SCC dll in unloaded, all SCC related functionalities will be inactive.

mpirvu commented 4 years ago

unload the SCC dll if -Xshareclasses:none presents

This means that we load the SCC dll before checking the command line options. Be that as it may, we could add another check for -Xshareclasses:none when SCC options are parsed.

hangshao0 commented 4 years ago

we could add another check for -Xshareclasses:none when SCC options are parsed.

Yes. It looks fine to me if another check for -Xshareclasses:none is added in the block L34 to L66.

mpirvu commented 4 years ago

Quarkus+CRUD on x86 loses 0.9% in throughput when we force a shift3 instead on shift0 for compresssedrefs

Stats for rest-crud-quarkus-openj9:j11 with JAVA_OPTS=-Xms128m -Xmx128m -Xshareclasses:none -XXgc:forcedShiftingCompressionAmount=3
Throughput:     avg=12040.20    min=11931.00    max=12111.10    stdDev=57.7     maxVar=1.51%    confInt=0.28%   samples=10
Footprint:      avg=123.01      min=105.90      max=129.90      stdDev=6.7      maxVar=22.66%   confInt=3.17%   samples=10

Stats for rest-crud-quarkus-openj9:j11 with JAVA_OPTS=-Xms128m -Xmx128m -Xshareclasses:none -XXgc:forcedShiftingCompressionAmount=0
Throughput:     avg=12140.48    min=12065.70    max=12209.50    stdDev=48.3     maxVar=1.19%    confInt=0.23%   samples=10
Footprint:      avg=125.31      min=120.40      max=129.30      stdDev=2.8      maxVar=7.39%    confInt=1.28%   samples=10

harryyu1994 commented 4 years ago

SPECjbb2015GMR multi_2grp_gencon

-Xms2g -Xmx2g -Xmn1g -Xgcpolicy:gencon -Xlp -Xcompressedrefs

Shift3

Job ID	max_jOPS	critical_jOPS	hbIR_max	hbIR_settled
9962189	12177	4435	13838	13099
9962190	13688	5824	19279	16099
9962191	14006	5749	16099	13449
9962192	11277	4417	13587	13415
means	12787	5106.25	15700.75	14015.5
medians	12932.5	5092	14968.5	13432
confidence_interval	0.15982393515568	0.24493717290884	0.26746374600271	0.1586869314041
min	11277	4417	13587	13099
max	14006	5824	19279	16099
stddev	1284.5183273637	786.11592656554	2639.4603457273	1397.9111798203

Shift0

Job ID	max_jOPS	critical_jOPS	hbIR_max	hbIR_settled
9962180	12718	4750	16099	14334
9962181	14972	5971	16099	13449
9962182	13040	4531	16099	13795
9962183	14167	5686	16099	13449
means	13724.25	5234.5	16099	13756.75
medians	13603.5	5218	16099	13622
confidence_interval	0.12035581689665	0.21319070129398	0	0.048339382455907
min	12718	4531	16099	13449
max	14972	5971	16099	14334
stddev	1038.2107605555	701.41214702912		417.97158994362

Added more runs

Shift 3

Job ID	max_jOPS	critical_jOPS	hbIR_max	hbIR_settled
9994492	13253	5439	16566	15678
9994493	13445	5534	14939	14322
9994494
9994495	15133	6707	16099	13449
means	13943.666666667	5893.3333333333	15868	14483
medians	13445	5534	16099	14322
confidence_interval	0.15737812859215	0.2542198958441	0.11199389127022	0.16451397337762
min	13253	5439	14939	13449
max	15133	6707	16566	15678
stddev	1034.4570234347	706.25514747387	837.73683218538	1123.1878738662

Shift 0

Job ID	max_jOPS	critical_jOPS	hbIR_max	hbIR_settled
9994483	12508	5022	13449	13024
9994484	13146	5101	14939	14322
9994485	16099	6128	16099	13449
9994486	13362	5131	16099	13795
means	13778.75	5345.5	15146.5	13647.5
medians	13254	5116	15519	13622
confidence_interval	0.18344971565841	0.1558672596321	0.13202131493524	0.064024675274026
min	12508	5022	13449	13024
max	16099	6128	16099	14322
stddev	1588.754097818	523.68852065581	1256.8578545988	549.19972080595

Don't think this is a good benchmark for this as the fluctuations are too large.

mpirvu commented 4 years ago

That's a 6.8% drop in max_jOPS and 2.5% drop in critical_jOPS. Fluctuations are huge though min-max ~20%, so I am not sure we can draw any conclusions with such a small dataset.

mpirvu commented 4 years ago

My DT7 experiments with AOT enabled show a 2.1% regression when moving from shift 0 to shift 3

Results for JDK=/home/mpirvu/sdks/OpenJDK11U-jre_x64_linux_openj9_2020-05-21-10-15 jvmOpts=-Xms1024m -Xmx1024m -XXgc:forcedShiftingCompressionAmount=0
Throughput      avg=3530.25     min=3434.60     max=3587.30     stdDev=44.9     maxVar=4.45%    confInt=0.74%   samples=10
CompTime        avg=137425.30   min=128296.00   max=179999.00   stdDev=15107.7  maxVar=40.30%   confInt=6.37%   samples=10
Footprint       avg=932900.80   min=912844.00   max=948924.00   stdDev=9184.8   maxVar=3.95%    confInt=0.57%   samples=10

Results for JDK=/home/mpirvu/sdks/OpenJDK11U-jre_x64_linux_openj9_2020-05-21-10-15 jvmOpts=-Xms1024m -Xmx1024m -XXgc:forcedShiftingCompressionAmount=3
Throughput      avg=3455.40     min=3392.70     max=3521.00     stdDev=36.2     maxVar=3.78%    confInt=0.61%   samples=10
CompTime        avg=139633.70   min=132116.00   max=182410.00   stdDev=15162.1  maxVar=38.07%   confInt=6.29%   samples=10
Footprint       avg=930844.00   min=922164.00   max=945488.00   stdDev=7221.0   maxVar=2.53%    confInt=0.45%   samples=10

harryyu1994 commented 4 years ago

That's a 6.8% drop in max_jOPS and 2.5% drop in critical_jOPS. Fluctuations are huge though min-max ~20%, so I am not sure we can draw any conclusions with such a small dataset.

Not sure why fluctuations are so large. Originally the heap size is set to 24GB, I had to change it to 2GB to be able to use shift0. Maybe the test does not work well for a smaller heap..

harryyu1994 commented 4 years ago

ILOG_WODM 851-4way-Seg5FastpathRVEJB (on Power)

Shift 0

Job ID	Global Throughput	Average response time	Min TPS	Max TPS	Pause Time	Total Pause Time
9994528	8638.9371504854	1.8573037069837	504.7412381469	582.0975	13.859526383526	14.445926640927
9994529	8762.1322573549	1.8264175610697	534.71616320959	560.30609923475	13.895542351454	14.478946902655
9994530	8740.2146440973	1.835861262807	506.15	586.72603318492	13.904566037736	14.485850314465
9994531	8664.0184392624	1.8513182105413	510.4425	579.65	14.166381979695	14.671502538071
9994532	8577.5554538932	1.8694810666913	509.735	571.67	13.984703208556	14.73531684492
9994533	8691.1423921766	1.8423403917498	525.7875	567.96858007855	14.010517902813	14.684122762148
9994534	8502.5874741253	1.8831568605721	494.7275	555.5325	14.109282694848	14.744678996037
9994535	8658.9934676143	1.8495857046636	515.9275	564.43	13.919508322663	14.490800256082
means	8654.4476598762	1.8519330956348	512.77842516956	571.04758906228	13.981253610162	14.592143156913
medians	8661.5059534384	1.8504519576024	510.08875	569.81929003927	13.95210576561	14.581151397076
confidence_interval	0.0081348395932811	0.0082178422245874	0.02052919348065	0.016151072515423	0.0065275035989177	0.0073217207304309
min	8502.5874741253	1.8264175610697	494.7275	555.5325	13.859526383526	14.445926640927
max	8762.1322573549	1.8831568605721	534.71616320959	586.72603318492	14.166381979695	14.744678996037
stddev	84.198081874972	0.018201070854603	12.589702870929	11.030304909653	0.10914581344745	0.1277750589018

Shift 3

Job ID	Global Throughput	Average response time	Min TPS	Max TPS	Pause Time	Total Pause Time
9994541	8119.8086062203	1.9734693630701	482.52	533.94	13.927745554036	14.523974008208
9994542	8286.7210135417	1.9335406181712	484.4725	541.3	14.028153225806	14.621639784946
9994543	8238.9619674721	1.9421575015578	502.385	524.27	14.281521505376	14.873037634409
9994544	8388.4166737499	1.9096835513286	500.78124804688	547.5725	13.915002663116	14.43093608522
9994545	8408.0668386632	1.9034667560754	515.315	539.43	13.90702393617	14.452569148936
9994546	8298.7935361939	1.9281740577079	509.36622658443	526.865	13.899852393617	14.452348404255
9994547	8441.6219797253	1.8954390538826	523.06	533.69	14.077266311585	14.597009320905
9994548	8412.5731827431	1.9026361479098	513.48871627821	540.7625	13.878481333333	14.53224
means	8324.3704747887	1.9235708812129	503.92358636369	535.97875	13.98938086538	14.56046929836
medians	8343.6051049719	1.9189288045183	505.87561329222	536.685	13.921374108576	14.528107004104
confidence_interval	0.010993300632059	0.011362164570798	0.024010815613053	0.012185469952163	0.0081766615307008	0.0082655166059187
min	8119.8086062203	1.8954390538826	482.52	524.27	13.878481333333	14.43093608522
max	8441.6219797253	1.9734693630701	523.06	547.5725	14.281521505376	14.873037634409
stddev	109.44435177091	0.026138647857235	14.470563630049	7.8109472171159	0.13680071373816	0.14393261774689

Seeing a 4% drop in throughput on Power.

vijaysun-omr commented 4 years ago

@andrewcraik @zl-wang see above overhead(s)

harryyu1994 commented 4 years ago

I have also updated the original post for SPECjbb2015GMR. I don't think we can draw any conclusions from that particular benchmark as we always seem to have large fluctuations (multiple attempts and not a small dataset considering each run takes over 3 hours).

amicic commented 4 years ago

most of variability typically comes from JIT, but this heap size is also fairly small for jbb2015 and may contribute (by having large variations in number of global GCs that are relatively expensive) as well.

if these tests are to be repeated try heap as big as possible while still being able to run shift0, with 2GB given to Tenure and rest to Nursery (for example -Xmx3200M -Xms3200M -Xmn1200M)

zl-wang commented 4 years ago

maybe due to the shift-3 case's bigger measurement variability, the overhead looked like twice more than expected. we had prior experience with this overhead ... about 2-2.5%. it might be worth of another more stable measurement on shift-3 case.

harryyu1994 commented 4 years ago

most of variability typically comes from JIT, but this heap size is also fairly small for jbb2015 and may contribute (by having large variations in number of global GCs that are relatively expensive) as well.

if these tests are to be repeated try heap as big as possible while still being able to run shift0, with 2GB given to Tenure and rest to Nursery (for example -Xmx3200M -Xms3200M -Xmn1200M)

Tried with -Xmx3200M -Xms3200M -Xmn1200M on x86 The shift0 runs were pretty stable, the shift3 runs were not. 3.2% drop in max_jOPS and 2% drop in cirtical_jOPS Going to give this another try..

Shift 3

Job ID	max_jOPS	critical_jOPS	hbIR_max	hbIR_settled
9994645	18938	11502	23095	22360
9994646	18938	11227	23095	21847
9994647	19169	10885	23095	21419
9994648	21247	11684	23095	19279
means	19573	11324.5	23095	21226.25
medians	19053.5	11364.5	23095	21633
confidence_interval	0.09114537985632	0.048897960523052	0	0.1014854988189
min	18938	10885	23095	19279
max	21247	11684	23095	22360
stddev	1121.3001382324	348.04836828617		1353.9639027685

Shift 0

Job ID	max_jOPS	critical_jOPS	hbIR_max	hbIR_settled
9994636	20555	12086	23095	22548
9994637	20093	11388	23095	22360
9994638	20202	11982	27674	23095
9994639	20093	11496	23095	22706
means	20235.75	11738	24239.75	22677.25
medians	20147.5	11739	23095	22627
confidence_interval	0.017214402858354	0.047064354097083	0.15027360018152	0.021914250955476
min	20093	11388	23095	22360
max	20555	12086	27674	23095
stddev	218.94805319984	347.22903104435	2289.5	312.35383248276

dmitripivkine commented 4 years ago

Do you use -XXgc:forcedShiftingCompressionAmount=3 to force shift 3 ?

harryyu1994 commented 4 years ago

Do you use -XXgc:forcedShiftingCompressionAmount=3 to force shift 3 ?

Yes

harryyu1994 commented 4 years ago

With a larger dataset, I'm measuring a 3.5% throughput drop on Power.

First 8 runs: (8288.3387340932 vs. 8581.7531089419) Next 8 runs: (8316.9258529547 vs. 8627.474541907) Both have 3.5% throughput drop.

Job ID	Global Throughput	Average response time	Min TPS	Max TPS	Pause Time	Total Pause Time
9994855	8227.0338779673	1.9450356503132	503.18874202814	526.035	13.904885598923	14.4253243607
9994856	8087.2274532643	1.9783969421294	494.9975	517.92093638361	13.935554945055	14.535127747253
9994857	8317.3519463409	1.9245225413624	500.05874985313	537.415	14.015772543742	14.649139973082
9994858	8286.6920988556	1.9310648048965	504.66	527.8886802783	14.190730201342	14.867684563758
9994859	8207.8741850326	1.9494444556917	506.135	519.855	13.985191117093	14.581004037685
9994860	8392.4408795395	1.907375533393	503.745	545.16	13.938728	14.444909333333
9994861	8334.6407679616	1.920294696344	499.66	533.34116664708	14.107698795181	14.611708165997
9994862	8453.4486637834	1.8928913679089	509.36872657818	532.835	14.051724842767	14.721377358491
means	8288.3387340932	1.9311282490049	502.72671480743	530.05634791362	14.016285755513	14.604534442537
medians	8302.0220225982	1.9277936731294	503.46687101407	530.36184013915	14.000481830417	14.596356101841
confidence_interval	0.011550726630632	0.011514626729118	0.0073576725719721	0.014272083518414	0.0057905694289637	0.0083223576011986
min	8087.2274532643	1.8928913679089	494.9975	517.92093638361	13.904885598923	14.4253243607
max	8453.4486637834	1.9783969421294	509.36872657818	545.16	14.190730201342	14.867684563758
stddev	114.49608734331	0.026593458983634	4.4237061399032	9.0473890683649	0.097066208198194	0.14536101225863

Job ID	Global Throughput	Average response time	Min TPS	Max TPS	Pause Time	Total Pause Time
9994868	8429.0328467609	1.9004089367754	506.765	552.32	14.128936925099	14.630515111695
9994869	8687.8546821798	1.8426826705212	522.0325	564.32608918478	13.933602791878	14.440649746193
9994870	8468.1746712228	1.8895803645447	514.3225	542.93614265964	14.165972440945	14.769738845144
9994871	8627.8107263089	1.8540877613996	525.47618630953	544.83613790966	13.899932484076	14.414615286624
9994872	8670.780576214	1.8459311333678	526.85723142769	555.88111029722	14.051856780735	14.553642585551
9994873	8555.9774620284	1.8699833702903	527.8925	541.9025	13.964604139715	14.498144890039
9994874	8579.5265630195	1.8652614005592	516.575	546.735	13.892619607843	14.611988235294
9994875	8634.8673438012	1.854260194939	509.6125	568.4310789223	13.914638569604	14.44150063857
means	8581.7531089419	1.8652744790497	518.69167721715	552.1710073717	13.994020467487	14.545099417389
medians	8603.6686446642	1.8597607977491	519.30375	549.5275	13.949103465797	14.525893737795
confidence_interval	0.0090968857000075	0.0092466689375262	0.013021956977964	0.015143005290046	0.0064300747604867	0.0069793709871253
min	8429.0328467609	1.8426826705212	506.765	541.9025	13.892619607843	14.414615286624
max	8687.8546821798	1.9004089367754	527.8925	568.4310789223	14.165972440945	14.769738845144
stddev	93.364677712505	0.02062727721853	8.0779169549425	9.9999889949771	0.107614892342	0.12140786619901

Job ID	Global Throughput	Average response time	Min TPS	Max TPS	Pause Time	Total Pause Time
9994930	8177.5107248928	1.9590636485304	487.47	541.2825	14.013406035665	14.80304526749
9994931	8432.8923393013	1.8986411307841	507.995	551.18	13.983329333333	14.492921333333
9994932	8228.0696491294	1.9454009387451	489.7725	529.6475	13.910288227334	14.588560216509
9994933	8403.3447909506	1.9049858210628	508.8425	543.6725	13.894525827815	14.501691390728
9994934	8385.7767133493	1.9085944995934	508.14622963443	541.58364604088	13.929327516778	14.648150335571
9994935	8330.2474844166	1.9258762781886	486.82378294054	558.22860442849	14.191576974565	14.858611780455
9994936	8267.1334929976	1.9375925939066	488.9325	546.98	14.161099319728	14.669331972789
9994937	8310.4316285999	1.9272640311479	494.9175	544.84113789716	14.038435549525	14.781244233378
means	8316.9258529547	1.9259273677449	496.61250157187	544.67698604582	14.015248598093	14.667944566282
medians	8320.3395565083	1.9265701546682	492.345	544.25681894858	13.998367684499	14.65874115418
confidence_interval	0.0089671497091839	0.0091344216653754	0.016839629548937	0.012702234285658	0.0066472032951229	0.0078402754117856
min	8177.5107248928	1.8986411307841	486.82378294054	529.6475	13.894525827815	14.492921333333
max	8432.8923393013	1.9590636485304	508.8425	558.22860442849	14.191576974565	14.858611780455
stddev	89.19306714941	0.021039470646772	10.001474451658	8.2743329580123	0.11141755278116	0.13753537856553

Job ID	Global Throughput	Average response time	Min TPS	Max TPS	Pause Time	Total Pause Time
9994917	8635.7070897714	1.853296592018	524.31	556.5375	14.183319371728	14.806257853403
9994918	8621.7088487345	1.8558855922125	531.7725	545.95	13.98879791395	14.501852672751
9994919	8533.2685029174	1.8753910779895	522.7925	546.08863477841	13.958782664942	14.470852522639
9994920	8574.0599079032	1.8675608874445	513.61871595321	564.5675	13.962389175258	14.472323453608
9994921	8510.2283664363	1.8815516956704	505.67	550.7275	14.622058124174	15.321787318362
9994922	8703.9959200816	1.8385274078447	526.2775	557.65	13.938370656371	14.615827541828
9994923	8781.1981281123	1.8221623975359	535.02	559.17860205349	14.360284634761	14.88064231738
9994924	8659.6295712997	1.8498789700963	506.085	569.3425	14.483114068441	14.988921419518
means	8627.474541907	1.8555318276015	520.69327699415	556.25527960399	14.187139576203	14.757308137436
medians	8628.707969253	1.8545910921153	523.55125	557.09375	14.086058642839	14.711042697615
confidence_interval	0.008675996428163	0.0087775211883997	0.017859539791627	0.012590065519452	0.015918315024142	0.017111525799816
min	8510.2283664363	1.8221623975359	505.67	545.95	13.938370656371	14.470852522639
max	8781.1981281123	1.8815516956704	535.02	569.3425	14.622058124174	15.321787318362
stddev	89.519345731438	0.019478438704893	11.121569557209	8.37560108854	0.27008830852039	0.30200193835895

harryyu1994 commented 4 years ago

SPECjbb2015 on x86. No throughput drop observed this time. (I grabbed the build from a different location this time, non-source code version of that build)

-Xmx3200M -Xms3200M -Xmn1200M

Shift0

Job ID	max_jOPS	critical_jOPS	hbIR_max	hbIR_settled
9994899	20202	11928	27674	23095
9994900	19649	11962	23392	22775
9994901	20324	11498	23095	21847
9994902	20479	11678	27674	23095
means	20163.5	11766.5	25458.75	22703
medians	20263	11803	25533	22935
confidence_interval	0.028503988445181	0.029647326634529	0.16003411426044	0.041365280701093
min	19649	11498	23095	21847
max	20479	11962	27674	23095
stddev	361.24460780289	219.26163975184	2560.8224427581	590.26773586229

Job ID	max_jOPS	critical_jOPS	hbIR_max	hbIR_settled
9995099	20913	11952	24604	23716
9995100	19631	11039	23095	21062
9995101
9995102
means	20272	11495.5	23849.5	22389
medians	20272	11495.5	23849.5	22389
confidence_interval	0.14229072923525	0.17870145527142	0.14236234682501	0.26671743115415
min	19631	11039	23095	21062
max	20913	11952	24604	23716
stddev	906.51089348115	645.58849122332	1067.0241328105	1876.6613972691

Shift3

Job ID	max_jOPS	critical_jOPS	hbIR_max	hbIR_settled
9994908	20785	12163	23095	20517
9994909	20093	11586	23095	22360
9994910	20785	11861	23095	21847
9994911	20324	12112	23095	20517
means	20496.75	11930.5	23095	21310.25
medians	20554.5	11986.5	23095	21182
confidence_interval	0.026852923897151	0.035325331114737	0	0.070149805133889
min	20093	11586	23095	20517
max	20785	12163	23095	22360
stddev	345.94448013133	264.89557691035		939.60395025422

Job ID	max_jOPS	critical_jOPS	hbIR_max	hbIR_settled
9995108	20755	12015	27674	23095
9995109	21309	12721	27674	23095
9995110	21016	12134	23095	21847
9995111	20755	12249	27674	23095
means	20958.75	12279.75	26529.25	22783
medians	20885.5	12191.5	27674	23095
confidence_interval	0.020035367673363	0.040072637108334	0.13730484276789	0.043575648509854
min	20755	12015	23095	21847
max	21309	12721	27674	23095
stddev	263.93228298183	309.29099027723	2289.5	624

harryyu1994 commented 4 years ago

Judging from the various experiments we tried, the overhead of shift3 on x86 isn't very significant.

eclipse-openj9 / openj9

Portable SCC: Compressed Refs #7965

Daytrader7

CompressedShift = 0

CompressedShift = 3

AcmeAir in Docker

SPECjbb2015GMR multi_2grp_gencon

Shift3

Shift0

Shift 3

Shift 0

ILOG_WODM 851-4way-Seg5FastpathRVEJB (on Power)

Shift 0

Shift 3

Shift 3

Shift 0

Shift0

Shift3