Open mpirvu opened 1 month ago
@vijaysun-omr this looked strange now. #19919 was intuitively expected to not regress performance. my proposal for the alternative implementation to try next: only disable(creating) the background marking thread, but not disable the allocation taxation (right now we did both when TARGET_CPU==1).
Thanks @zl-wang before we discuss that next change, maybe we should understand why we even got a 7% (not a small) regression to begin with. It suggests that GC was a significant factor in the setup originally and I'm unclear if we necessarily can believe that it was to that extent.
@mpirvu could you gather some profiles to demystify the situation? did we spend significantly more time in GC when noConcurrentMark is set? it is likely a valid scenario to compare when CPU==2 with/without -Xgc:noConcurrentMark
Interestingly, I cannot reproduce this regression outside containers. One profile when using containers doesn't show GC activity
2024/07/25
65.73% 399724 [JIT] tid 68854
26.29% 215196 [kernel.kallsyms]
2.34% 13507 libj9gc29.so
1.54% 9561 libc.so.6
1.52% 8791 libj9jit29.so
1.41% 8501 libj9vm29.so
0.27% 2323 [vdso]
0.20% 1186 libj9trc29.so
0.19% 1224 libj9thr29.so
2024/07/26
67.42% 416290 [JIT] tid 65998
25.29% 209342 [kernel.kallsyms]
1.93% 11221 libj9gc29.so
1.49% 8607 libj9jit29.so
1.45% 9297 libc.so.6
1.32% 8194 libj9vm29.so
0.26% 2288 [vdso]
0.22% 1289 libj9trc29.so
0.13% 1066 libnio.so
0.13% 873 libj9thr29.so
I wonder if the CPU utilization was comparable across the two runs. Trying to offer some theory for how throughput could be regressed as you mentioned with the profile as above.
I am baffled by the fact that I don't see a regression outside containers. I ran a bigger batch (20 runs) in containers and the regression came out at 6%:
Results for image: acmeairee8:24.0.0.3-J17-20240725 and opts:
Thr stats: Avg= 4802.4 StdDev= 39.5 Min= 4708.5 Max= 4860.8 Max/Min= 1.0 CI95= 0.4% samples=20
RSS stats: Avg= 228.0 StdDev= 2.8 Min= 224.5 Max= 237.8 Max/Min= 1.1 CI95= 0.6% samples=20
Peak RSS stats: Avg= 322.2 StdDev= 10.2 Min= 302.4 Max= 338.6 Max/Min= 1.1 CI95= 1.5% samples=20
CompCPU stats: Avg= 26.3 StdDev= 0.8 Min= 23.9 Max= 27.4 Max/Min= 1.1 CI95= 1.5% samples=20
StartTime stats: Avg= 2383.3 StdDev= 59.5 Min= 2254.0 Max= 2479.0 Max/Min= 1.1 CI95= 1.2% samples=20
Results for image: acmeairee8:24.0.0.3-J17-20240726 and opts:
Thr stats: Avg= 4513.3 StdDev= 58.1 Min= 4400.3 Max= 4584.7 Max/Min= 1.0 CI95= 0.6% samples=20
RSS stats: Avg= 221.6 StdDev= 1.2 Min= 219.4 Max= 223.3 Max/Min= 1.0 CI95= 0.3% samples=20
Peak RSS stats: Avg= 309.3 StdDev= 10.6 Min= 287.8 Max= 334.5 Max/Min= 1.2 CI95= 1.6% samples=20
CompCPU stats: Avg= 29.2 StdDev= 0.7 Min= 27.7 Max= 30.5 Max/Min= 1.1 CI95= 1.1% samples=20
StartTime stats: Avg= 4608.6 StdDev= 45.0 Min= 4537.0 Max= 4701.0 Max/Min= 1.0 CI95= 0.5% samples=20
Now that I pay attention to other stats, StartTime is completely off. I'll look at whether there was some mistake at how the container image was built.
Verbose logs shows that containers in 1P mode cannot use AOT
#INFO: AOT header validation failed: incompatible gc write barrier type
When I build the container I don't restrict the number of CPUs, but at runtime I do. In order to have "portableAOT" we cannot perform the change from #19919 in containers.
To be completely clear, we cannot perform the change in containers "in AOT compiles". We can do it in JIT compiles unless one is running InstantOn mode before checkpoint is generated (where we have a need to generate portable JIT compiles). We can perform the change in containers in JIT compiles after restore even in InstantOn mode if there is only 1P.
I'm guessing the prior issue around AOT incompatibility is not something that would affect steady state throughput. Unless the regression was measured at a time when we were still ramping up (and so having AOT code vs not having it in a warm run affects the throughput measured).
more relevantly: could we skip checking that GC characteristics, since write-barrier with allocation-taxation enabled should be still compatible in running env where it is disabled (I think disabling background marking thread is irrelevant to the running inside container)?
Unless the regression was measured at a time when we were still ramping up
The JVM has reached steady state when throughput is measured. I have runs with -Xshareclasses:none in progress.
There is no regression if I run in containers in 1P mode but use -Xshareclasses:none. The problem is only seen when AOT cannot be loaded because of the GC barrier type.
There is no regression if I run in containers in 1P mode but use -Xshareclasses:none. The problem is only seen when AOT cannot be loaded because of the GC barrier type.
That is good to know. My thinking is/was: the existing AOT barrier type (produced on multiple CPUs config) is still compatible with single-CPU container running environment. we should skip checking that particular compatibility when we are about to load from SCC. Let's ask @dmitripivkine to confirm anyway ...
Something still doesn't add up for me. Why would the ability to load AOT code or not affect steady state throughput to such an extent ?
I don't understand the connection between noAOT and lower peak throughput.
Is it possible that we use profile info (and/or some other SCC hint) when AOT is enabled, but do not do so when AOT is disabled ? And this somehow affects JIT compilations that run at steady state
I feel the change in question ought to be backed out as we understand and fix the startup time and (oddly) steady state throughput, and the interactions with AOT/JIT code portability in various modes. I don't think Marius will have the bandwidth to investigate this in depth given some other investigations he is part of currently.
There is no regression if I run in containers in 1P mode but use -Xshareclasses:none. The problem is only seen when AOT cannot be loaded because of the GC barrier type.
That is good to know. My thinking is/was: the existing AOT barrier type (produced on multiple CPUs config) is still compatible with single-CPU container running environment. we should skip checking that particular compatibility when we are about to load from SCC. Let's ask @dmitripivkine to confirm anyway ...
Sorry for late response.
Technically there are different barriers types - general for -Xgcpolicy:gencon
it isj9gc_modron_wrtbar_cardmark_and_oldcheck
(generational check and a concurrent mark check) and for -Xgcpolicy:gencon -Xconcurrentlevel0
it is j9gc_modron_wrtbar_oldcheck
(generational check only).
j9gc_modron_wrtbar_oldcheck
(Remembered Set support) is a subset of j9gc_modron_wrtbar_cardmark_and_oldcheck
(Remembered Set and Card Table support) obviously. So, I think we can have special handling for single CPU no Concurrent Mark mode and make it compatible with general Gencon barrier. However is the purpose of this optimization gaining performance by calling barrier simplification (yes, not only, but it is a significant part)? Using AOT (generated with general barrier) means spent extra time in the barrier by useless maintaining Card Table and loosing benefits of concurrent GC (even on single CPU) at the same time. The alternative to allow Concurrent GC in single CPU environment if AOT is used does not look good as well - I believe the main target for this optimization is containers environment configured with single CPU, so AOT is used by default and optimization can be pointless.
I spotted a 7% throughput regression when running AcmeAirEE8 in containers limited to 1 CPU. I limited the regression to the following two nightly builds: OpenJ9-JDK17-x86-64_linux-20240725-231127.tar.gz
and
OpenJ9-JDK17-x86-64_linux-20240726-231341.tar.gz
Looking at OpenJ9 changes between the two builds:
https://github.com/eclipse-openj9/openj9/compare/9142f7eb060...487b005e140
The most likely culprit is https://github.com/eclipse-openj9/openj9/pull/19919 "Treat single CPU as if -Xgc:noConcurrentMark were set" If I run the benchmark in 2 CPU mode, then there is no regression between the two builds indicated above. This result increases the confidence that the 7% regression is due to the PR 19919. Attn: @zl-wang