7% AcmeAirEE8 throughput regression in 1P mode

mpirvu commented 1 month ago

I spotted a 7% throughput regression when running AcmeAirEE8 in containers limited to 1 CPU. I limited the regression to the following two nightly builds: OpenJ9-JDK17-x86-64_linux-20240725-231127.tar.gz

openjdk version "17.0.12-internal" 2024-07-16
OpenJDK Runtime Environment (build 17.0.12-internal+0-adhoc..BuildJDK17x86-64linuxNightly)
Eclipse OpenJ9 VM (build master-9142f7eb060, JRE 17 Linux amd64-64-Bit Compressed References 20240725_796 (JIT enabled, AOT enabled)
OpenJ9   - 9142f7eb060
OMR      - d18121d17c5
JCL      - 2998493e77e based on jdk-17.0.12+7)

and

OpenJ9-JDK17-x86-64_linux-20240726-231341.tar.gz

Openjdk version "17.0.12-internal" 2024-07-16
OpenJDK Runtime Environment (build 17.0.12-internal+0-adhoc..BuildJDK17x86-64linuxNightly)
Eclipse OpenJ9 VM (build master-487b005e140, JRE 17 Linux amd64-64-Bit Compressed References 20240726_797 (JIT enabled, AOT enabled)
OpenJ9   - 487b005e140
OMR      - d18121d17c5
JCL      - 2998493e77e based on jdk-17.0.12+7)

Looking at OpenJ9 changes between the two builds: https://github.com/eclipse-openj9/openj9/compare/9142f7eb060...487b005e140 The most likely culprit is https://github.com/eclipse-openj9/openj9/pull/19919 "Treat single CPU as if -Xgc:noConcurrentMark were set" If I run the benchmark in 2 CPU mode, then there is no regression between the two builds indicated above. This result increases the confidence that the 7% regression is due to the PR 19919. Attn: @zl-wang

zl-wang commented 1 month ago

@vijaysun-omr this looked strange now. #19919 was intuitively expected to not regress performance. my proposal for the alternative implementation to try next: only disable(creating) the background marking thread, but not disable the allocation taxation (right now we did both when TARGET_CPU==1).

vijaysun-omr commented 1 month ago

Thanks @zl-wang before we discuss that next change, maybe we should understand why we even got a 7% (not a small) regression to begin with. It suggests that GC was a significant factor in the setup originally and I'm unclear if we necessarily can believe that it was to that extent.

zl-wang commented 1 month ago

@mpirvu could you gather some profiles to demystify the situation? did we spend significantly more time in GC when noConcurrentMark is set? it is likely a valid scenario to compare when CPU==2 with/without -Xgc:noConcurrentMark

mpirvu commented 1 month ago

Interestingly, I cannot reproduce this regression outside containers. One profile when using containers doesn't show GC activity

2024/07/25

  65.73%        399724  [JIT] tid 68854
  26.29%        215196  [kernel.kallsyms]
   2.34%         13507  libj9gc29.so
   1.54%          9561  libc.so.6
   1.52%          8791  libj9jit29.so
   1.41%          8501  libj9vm29.so
   0.27%          2323  [vdso]
   0.20%          1186  libj9trc29.so
   0.19%          1224  libj9thr29.so

2024/07/26

  67.42%        416290  [JIT] tid 65998
  25.29%        209342  [kernel.kallsyms]
   1.93%         11221  libj9gc29.so
   1.49%          8607  libj9jit29.so
   1.45%          9297  libc.so.6
   1.32%          8194  libj9vm29.so
   0.26%          2288  [vdso]
   0.22%          1289  libj9trc29.so
   0.13%          1066  libnio.so
   0.13%           873  libj9thr29.so

vijaysun-omr commented 1 month ago

I wonder if the CPU utilization was comparable across the two runs. Trying to offer some theory for how throughput could be regressed as you mentioned with the profile as above.

mpirvu commented 1 month ago

I am baffled by the fact that I don't see a regression outside containers. I ran a bigger batch (20 runs) in containers and the regression came out at 6%:

Results for image: acmeairee8:24.0.0.3-J17-20240725 and opts:
Thr stats:        Avg= 4802.4  StdDev=   39.5  Min= 4708.5  Max= 4860.8  Max/Min=    1.0 CI95=    0.4%  samples=20
RSS stats:        Avg=  228.0  StdDev=    2.8  Min=  224.5  Max=  237.8  Max/Min=    1.1 CI95=    0.6%  samples=20
Peak RSS stats:   Avg=  322.2  StdDev=   10.2  Min=  302.4  Max=  338.6  Max/Min=    1.1 CI95=    1.5%  samples=20
CompCPU stats:    Avg=   26.3  StdDev=    0.8  Min=   23.9  Max=   27.4  Max/Min=    1.1 CI95=    1.5%  samples=20
StartTime stats:  Avg= 2383.3  StdDev=   59.5  Min= 2254.0  Max= 2479.0  Max/Min=    1.1 CI95=    1.2%  samples=20

Results for image: acmeairee8:24.0.0.3-J17-20240726 and opts:
Thr stats:        Avg= 4513.3  StdDev=   58.1  Min= 4400.3  Max= 4584.7  Max/Min=    1.0 CI95=    0.6%  samples=20
RSS stats:        Avg=  221.6  StdDev=    1.2  Min=  219.4  Max=  223.3  Max/Min=    1.0 CI95=    0.3%  samples=20
Peak RSS stats:   Avg=  309.3  StdDev=   10.6  Min=  287.8  Max=  334.5  Max/Min=    1.2 CI95=    1.6%  samples=20
CompCPU stats:    Avg=   29.2  StdDev=    0.7  Min=   27.7  Max=   30.5  Max/Min=    1.1 CI95=    1.1%  samples=20
StartTime stats:  Avg= 4608.6  StdDev=   45.0  Min= 4537.0  Max= 4701.0  Max/Min=    1.0 CI95=    0.5%  samples=20

Now that I pay attention to other stats, StartTime is completely off. I'll look at whether there was some mistake at how the container image was built.

mpirvu commented 1 month ago

Verbose logs shows that containers in 1P mode cannot use AOT

#INFO:  AOT header validation failed: incompatible gc write barrier type

When I build the container I don't restrict the number of CPUs, but at runtime I do. In order to have "portableAOT" we cannot perform the change from #19919 in containers.

vijaysun-omr commented 1 month ago

To be completely clear, we cannot perform the change in containers "in AOT compiles". We can do it in JIT compiles unless one is running InstantOn mode before checkpoint is generated (where we have a need to generate portable JIT compiles). We can perform the change in containers in JIT compiles after restore even in InstantOn mode if there is only 1P.

vijaysun-omr commented 1 month ago

I'm guessing the prior issue around AOT incompatibility is not something that would affect steady state throughput. Unless the regression was measured at a time when we were still ramping up (and so having AOT code vs not having it in a warm run affects the throughput measured).

zl-wang commented 1 month ago

more relevantly: could we skip checking that GC characteristics, since write-barrier with allocation-taxation enabled should be still compatible in running env where it is disabled (I think disabling background marking thread is irrelevant to the running inside container)?

mpirvu commented 1 month ago

Unless the regression was measured at a time when we were still ramping up

The JVM has reached steady state when throughput is measured. I have runs with -Xshareclasses:none in progress.

mpirvu commented 1 month ago

There is no regression if I run in containers in 1P mode but use -Xshareclasses:none. The problem is only seen when AOT cannot be loaded because of the GC barrier type.

zl-wang commented 1 month ago

There is no regression if I run in containers in 1P mode but use -Xshareclasses:none. The problem is only seen when AOT cannot be loaded because of the GC barrier type.

That is good to know. My thinking is/was: the existing AOT barrier type (produced on multiple CPUs config) is still compatible with single-CPU container running environment. we should skip checking that particular compatibility when we are about to load from SCC. Let's ask @dmitripivkine to confirm anyway ...

vijaysun-omr commented 1 month ago

Something still doesn't add up for me. Why would the ability to load AOT code or not affect steady state throughput to such an extent ?

mpirvu commented 1 month ago

I don't understand the connection between noAOT and lower peak throughput.

vijaysun-omr commented 1 month ago

Is it possible that we use profile info (and/or some other SCC hint) when AOT is enabled, but do not do so when AOT is disabled ? And this somehow affects JIT compilations that run at steady state

vijaysun-omr commented 1 month ago

I feel the change in question ought to be backed out as we understand and fix the startup time and (oddly) steady state throughput, and the interactions with AOT/JIT code portability in various modes. I don't think Marius will have the bandwidth to investigate this in depth given some other investigations he is part of currently.

zl-wang commented 1 month ago

reverted: https://github.com/eclipse-openj9/openj9/pull/20006

dmitripivkine commented 1 month ago

There is no regression if I run in containers in 1P mode but use -Xshareclasses:none. The problem is only seen when AOT cannot be loaded because of the GC barrier type.

That is good to know. My thinking is/was: the existing AOT barrier type (produced on multiple CPUs config) is still compatible with single-CPU container running environment. we should skip checking that particular compatibility when we are about to load from SCC. Let's ask @dmitripivkine to confirm anyway ...

Sorry for late response. Technically there are different barriers types - general for -Xgcpolicy:gencon it isj9gc_modron_wrtbar_cardmark_and_oldcheck (generational check and a concurrent mark check) and for -Xgcpolicy:gencon -Xconcurrentlevel0 it is j9gc_modron_wrtbar_oldcheck (generational check only).

j9gc_modron_wrtbar_oldcheck (Remembered Set support) is a subset of j9gc_modron_wrtbar_cardmark_and_oldcheck (Remembered Set and Card Table support) obviously. So, I think we can have special handling for single CPU no Concurrent Mark mode and make it compatible with general Gencon barrier. However is the purpose of this optimization gaining performance by calling barrier simplification (yes, not only, but it is a significant part)? Using AOT (generated with general barrier) means spent extra time in the barrier by useless maintaining Card Table and loosing benefits of concurrent GC (even on single CPU) at the same time. The alternative to allow Concurrent GC in single CPU environment if AOT is used does not look good as well - I believe the main target for this optimization is containers environment configured with single CPU, so AOT is used by default and optimization can be pointless.

eclipse-openj9 / openj9

7% AcmeAirEE8 throughput regression in 1P mode #19971