Closed gbanasiak closed 1 month ago
Pinging @elastic/es-search (Team:Search)
Initial thoughts are that the bounds checks are not being elided and result in less inlining and less optimal code.
We need to reproduce this situation in an independent test case / benchmark. Preferably using jmh.
~Bisect shows that the perf regression occurs between jdk-23+30 and jdk-23+31. Next is the check the set of commits between those tags.~
Ok, so there is a lot of variability in this benchmark 0-60% per run, normally and with JDK 22. This previously confused my testing.
JDK 23 brings a larger amount of variability ~160-180%. I'm confident now that this occurs between jdk-23+24 and jdk-23+25.
$ git log --pretty=oneline jdk-23+24...jdk-23+25
...
^^^ this probably a red herring. There were some change in +25 which resulted the removal of a some varhandle caching, which was later restored in +26.
The compiler output mostly matches with that of what we see in the profiles. MethodHandle::asType
is not being inline which results in the method handle not being a constant - which performs more poorly than if it were a constant.
JDK 23:
@ 16 org.apache.lucene.util.packed.DirectReader$DirectPackedReader40::get (34 bytes) inline (hot)
\-> TypeProfile (11846/11846 counts) = org/apache/lucene/util/packed/DirectReader$DirectPackedReader40
@ 14 org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl::readLong (31 bytes) inline (hot)
\-> TypeProfile (11494/11494 counts) = org/apache/lucene/store/MemorySegmentIndexInput$SingleSegmentImpl
@ 8 jdk.internal.foreign.AbstractMemorySegmentImpl::get (12 bytes) force inline by annotation
\-> TypeProfile (11022/11022 counts) = jdk/internal/foreign/MappedMemorySegmentImpl
@ 1 jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::varHandle (43 bytes) force inline by annotation
@ 8 java.lang.invoke.VarHandleGuards::guard_LJ_J (84 bytes) force inline by annotation
@ 3 java.lang.invoke.IndirectVarHandle::checkAccessModeThenIsDirect (8 bytes) force inline by annotation
@ 2 java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 bytes) force inline by annotation
@ 59 java.lang.invoke.VarHandle::getMethodHandle (41 bytes) force inline by annotation
@ 71 java.lang.invoke.MethodHandle::asType (32 bytes) failed to inline: already compiled into a big method
@ 75 java.lang.invoke.IndirectVarHandle::asDirect (5 bytes) accessor
@ 80 java.lang.invoke.MethodHandle::invokeBasic(LLJ)J (0 bytes) failed to inline: receiver not constant
JDK 22:
JDK 22
@ 16 org.apache.lucene.util.packed.DirectReader$DirectPackedReader40::get (34 bytes) inline (hot)
\-> TypeProfile (6998/6998 counts) = org/apache/lucene/util/packed/DirectReader$DirectPackedReader40
@ 14 org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl::readLong (31 bytes) inline (hot)
\-> TypeProfile (7475/7475 counts) = org/apache/lucene/store/MemorySegmentIndexInput$SingleSegmentImpl
@ 8 jdk.internal.foreign.AbstractMemorySegmentImpl::get (12 bytes) force inline by annotation
\-> TypeProfile (10270/10270 counts) = jdk/internal/foreign/MappedMemorySegmentImpl
@ 1 jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::varHandle (26 bytes) force inline by annotation
@ 8 java.lang.invoke.VarHandleGuards::guard_LJ_J (84 bytes) force inline by annotation
@ 3 java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 bytes) force inline by annotation
@ 46 java.lang.invoke.VarForm::getMemberName (38 bytes) force inline by annotation
@ 49 java.lang.invoke.VarHandleSegmentAsLongs::get (52 bytes) force inline by annotation
@ 14 java.lang.invoke.VarHandleSegmentAsLongs::checkAddress (21 bytes) force inline by annotation
@ 1 java.util.Objects::requireNonNull (14 bytes) force inline by annotation
@ 15 jdk.internal.foreign.AbstractMemorySegmentImpl::checkAccess (30 bytes) force inline by annotation
@ 26 jdk.internal.foreign.AbstractMemorySegmentImpl::checkBounds (54 bytes) force inline by annotation
@ 16 jdk.internal.util.Preconditions::checkIndex (22 bytes) (intrinsic)
@ 24 jdk.internal.foreign.AbstractMemorySegmentImpl::sessionImpl (5 bytes) accessor
@ 29 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetBase (2 bytes) inline (hot)
@ 40 java.lang.invoke.VarHandleSegmentAsLongs::offsetPlain (39 bytes) force inline by annotation
@ 1 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes) accessor
@ 13 jdk.internal.foreign.NativeMemorySegmentImpl::maxAlignMask (2 bytes) inline (hot)
@ 48 jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned (18 bytes) force inline by annotation
@ 6 jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal (36 bytes) force inline by annotation
@ 5 jdk.internal.foreign.MemorySessionImpl::checkValidStateRaw (33 bytes) force inline by annotation
@ 15 jdk.internal.misc.Unsafe::getLongUnaligned (12 bytes) inline (hot)
@ 5 jdk.internal.misc.Unsafe::getLongUnaligned (173 bytes) (intrinsic)
@ 8 jdk.internal.misc.Unsafe::convEndian (16 bytes) inline (hot)
@ 21 java.lang.ref.Reference::reachabilityFence (1 bytes) force inline by annotation
Further discussion on https://mail.openjdk.org/pipermail/panama-dev/2024-September/020643.html
Using a few JIT compiler directives restores much of the performance loss that we're seeing here from the memory segment access. Specifics to follow when we settle on the actual compiler directive to include.
Below is the output of the nested
Rally track with JDK 23 along with the follow compile directives:
-XX:CompileCommand=dontinline,java/lang/invoke/MethodHandle.setAsTypeCache
-XX:CompileCommand=dontinline,java/lang/invoke/MethodHandle.asTypeUncached
The summary is that all scenarios, except the 100% percentile, show improvement anywhere from 2-10%. Though there can be some variance in this benchmark.
| Metric | Task | Baseline | Contender | Diff | Unit | Diff % |
|--------------------------------------------------------------:|-----------------------------------------------------------:|--------------:|--------------:|----------:|-------:|----------:|
| Cumulative indexing time of primary shards | | 5.0136 | 4.9355 | -0.0781 | min | -1.56% |
| Min cumulative indexing time across primary shard | | 5.0136 | 4.9355 | -0.0781 | min | -1.56% |
| Median cumulative indexing time across primary shard | | 5.0136 | 4.9355 | -0.0781 | min | -1.56% |
| Max cumulative indexing time across primary shard | | 5.0136 | 4.9355 | -0.0781 | min | -1.56% |
| Cumulative indexing throttle time of primary shards | | 0 | 0 | 0 | min | 0.00% |
| Min cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% |
| Median cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% |
| Max cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min | 0.00% |
| Cumulative merge time of primary shards | | 0.424417 | 0.39735 | -0.02707 | min | -6.38% |
| Cumulative merge count of primary shards | | 4 | 4 | 0 | | 0.00% |
| Min cumulative merge time across primary shard | | 0.424417 | 0.39735 | -0.02707 | min | -6.38% |
| Median cumulative merge time across primary shard | | 0.424417 | 0.39735 | -0.02707 | min | -6.38% |
| Max cumulative merge time across primary shard | | 0.424417 | 0.39735 | -0.02707 | min | -6.38% |
| Cumulative merge throttle time of primary shards | | 0.2246 | 0.200933 | -0.02367 | min | -10.54% |
| Min cumulative merge throttle time across primary shard | | 0.2246 | 0.200933 | -0.02367 | min | -10.54% |
| Median cumulative merge throttle time across primary shard | | 0.2246 | 0.200933 | -0.02367 | min | -10.54% |
| Max cumulative merge throttle time across primary shard | | 0.2246 | 0.200933 | -0.02367 | min | -10.54% |
| Cumulative refresh time of primary shards | | 0.2824 | 0.268 | -0.0144 | min | -5.10% |
| Cumulative refresh count of primary shards | | 20 | 20 | 0 | | 0.00% |
| Min cumulative refresh time across primary shard | | 0.2824 | 0.268 | -0.0144 | min | -5.10% |
| Median cumulative refresh time across primary shard | | 0.2824 | 0.268 | -0.0144 | min | -5.10% |
| Max cumulative refresh time across primary shard | | 0.2824 | 0.268 | -0.0144 | min | -5.10% |
| Cumulative flush time of primary shards | | 0.142317 | 0.14775 | 0.00543 | min | +3.82% |
| Cumulative flush count of primary shards | | 2 | 2 | 0 | | 0.00% |
| Min cumulative flush time across primary shard | | 0.142317 | 0.14775 | 0.00543 | min | +3.82% |
| Median cumulative flush time across primary shard | | 0.142317 | 0.14775 | 0.00543 | min | +3.82% |
| Max cumulative flush time across primary shard | | 0.142317 | 0.14775 | 0.00543 | min | +3.82% |
| Total Young Gen GC time | | 2.131 | 2.147 | 0.016 | s | +0.75% |
| Total Young Gen GC count | | 47 | 49 | 2 | | +4.26% |
| Total Old Gen GC time | | 0 | 0 | 0 | s | 0.00% |
| Total Old Gen GC count | | 0 | 0 | 0 | | 0.00% |
| Store size | | 2.84918 | 2.84424 | -0.00495 | GB | -0.17% |
| Translog size | | 5.12227e-08 | 5.12227e-08 | 0 | GB | 0.00% |
| Heap used for segments | | 0 | 0 | 0 | MB | 0.00% |
| Heap used for doc values | | 0 | 0 | 0 | MB | 0.00% |
| Heap used for terms | | 0 | 0 | 0 | MB | 0.00% |
| Heap used for norms | | 0 | 0 | 0 | MB | 0.00% |
| Heap used for points | | 0 | 0 | 0 | MB | 0.00% |
| Heap used for stored fields | | 0 | 0 | 0 | MB | 0.00% |
| Segment count | | 25 | 25 | 0 | | 0.00% |
| Total Ingest Pipeline count | | 0 | 0 | 0 | | 0.00% |
| Total Ingest Pipeline time | | 0 | 0 | 0 | ms | 0.00% |
| Total Ingest Pipeline failed | | 0 | 0 | 0 | | 0.00% |
| error rate | index-append | 0 | 0 | 0 | % | 0.00% |
| Min Throughput | randomized-nested-queries | 19.9367 | 19.9357 | -0.00098 | ops/s | -0.00% |
| Mean Throughput | randomized-nested-queries | 19.9646 | 19.9643 | -0.00034 | ops/s | -0.00% |
| Median Throughput | randomized-nested-queries | 19.9678 | 19.9675 | -0.00025 | ops/s | -0.00% |
| Max Throughput | randomized-nested-queries | 19.9784 | 19.978 | -0.00039 | ops/s | -0.00% |
| 50th percentile latency | randomized-nested-queries | 9.20467 | 8.99144 | -0.21323 | ms | -2.32% |
| 90th percentile latency | randomized-nested-queries | 13.5691 | 12.7468 | -0.82222 | ms | -6.06% |
| 99th percentile latency | randomized-nested-queries | 16.4336 | 15.1094 | -1.3242 | ms | -8.06% |
| 99.9th percentile latency | randomized-nested-queries | 30.5139 | 16.4714 | -14.0425 | ms | -46.02% |
| 100th percentile latency | randomized-nested-queries | 41.8193 | 18.1637 | -23.6556 | ms | -56.57% |
| 50th percentile service time | randomized-nested-queries | 8.29052 | 8.075 | -0.21552 | ms | -2.60% |
| 90th percentile service time | randomized-nested-queries | 12.5849 | 11.6764 | -0.90847 | ms | -7.22% |
| 99th percentile service time | randomized-nested-queries | 15.3765 | 14.0032 | -1.37333 | ms | -8.93% |
| 99.9th percentile service time | randomized-nested-queries | 29.3269 | 15.3982 | -13.9288 | ms | -47.49% |
| 100th percentile service time | randomized-nested-queries | 40.6813 | 16.7806 | -23.9006 | ms | -58.75% |
| error rate | randomized-nested-queries | 0 | 0 | 0 | % | 0.00% |
| Min Throughput | randomized-term-queries | 25.0137 | 25.0137 | 4e-05 | ops/s | 0.00% |
| Mean Throughput | randomized-term-queries | 25.0161 | 25.0159 | -0.00019 | ops/s | -0.00% |
| Median Throughput | randomized-term-queries | 25.0159 | 25.0157 | -0.00014 | ops/s | -0.00% |
| Max Throughput | randomized-term-queries | 25.0188 | 25.0187 | -9e-05 | ops/s | -0.00% |
| 50th percentile latency | randomized-term-queries | 2.65855 | 2.74416 | 0.0856 | ms | +3.22% |
| 90th percentile latency | randomized-term-queries | 3.05292 | 3.19991 | 0.14699 | ms | +4.81% |
| 99th percentile latency | randomized-term-queries | 3.41249 | 3.91761 | 0.50512 | ms | +14.80% |
| 100th percentile latency | randomized-term-queries | 3.85166 | 59.0052 | 55.1536 | ms | +1431.94% |
| 50th percentile service time | randomized-term-queries | 1.75532 | 1.87757 | 0.12226 | ms | +6.96% |
| 90th percentile service time | randomized-term-queries | 1.97695 | 2.02421 | 0.04726 | ms | +2.39% |
| 99th percentile service time | randomized-term-queries | 2.24446 | 2.32631 | 0.08185 | ms | +3.65% |
| 100th percentile service time | randomized-term-queries | 2.75224 | 57.9596 | 55.2074 | ms | +2005.91% |
| error rate | randomized-term-queries | 0 | 0 | 0 | % | 0.00% |
| Min Throughput | randomized-sorted-term-queries | 15.9873 | 15.99 | 0.00271 | ops/s | +0.02% |
| Mean Throughput | randomized-sorted-term-queries | 15.9892 | 15.9915 | 0.00226 | ops/s | +0.01% |
| Median Throughput | randomized-sorted-term-queries | 15.9894 | 15.9915 | 0.00206 | ops/s | +0.01% |
| Max Throughput | randomized-sorted-term-queries | 15.9908 | 15.9927 | 0.00199 | ops/s | +0.01% |
| 50th percentile latency | randomized-sorted-term-queries | 44.0784 | 44.2931 | 0.21477 | ms | +0.49% |
| 90th percentile latency | randomized-sorted-term-queries | 77.8522 | 78.971 | 1.11877 | ms | +1.44% |
| 99th percentile latency | randomized-sorted-term-queries | 104.355 | 111.916 | 7.56151 | ms | +7.25% |
| 100th percentile latency | randomized-sorted-term-queries | 136.932 | 129.952 | -6.97986 | ms | -5.10% |
| 50th percentile service time | randomized-sorted-term-queries | 43.0512 | 42.9996 | -0.05158 | ms | -0.12% |
| 90th percentile service time | randomized-sorted-term-queries | 77.1244 | 78.3077 | 1.18329 | ms | +1.53% |
| 99th percentile service time | randomized-sorted-term-queries | 103.37 | 110.79 | 7.42029 | ms | +7.18% |
| 100th percentile service time | randomized-sorted-term-queries | 135.008 | 129.092 | -5.91573 | ms | -4.38% |
| error rate | randomized-sorted-term-queries | 0 | 0 | 0 | % | 0.00% |
| Min Throughput | match-all | 5.00344 | 5.00339 | -6e-05 | ops/s | -0.00% |
| Mean Throughput | match-all | 5.00404 | 5.00398 | -7e-05 | ops/s | -0.00% |
| Median Throughput | match-all | 5.00401 | 5.00394 | -7e-05 | ops/s | -0.00% |
| Max Throughput | match-all | 5.00479 | 5.00473 | -6e-05 | ops/s | -0.00% |
| 50th percentile latency | match-all | 2.96392 | 2.87872 | -0.08521 | ms | -2.87% |
| 90th percentile latency | match-all | 3.36409 | 3.29008 | -0.07401 | ms | -2.20% |
| 99th percentile latency | match-all | 3.79417 | 3.55432 | -0.23985 | ms | -6.32% |
| 100th percentile latency | match-all | 5.74625 | 5.81256 | 0.06631 | ms | +1.15% |
| 50th percentile service time | match-all | 1.78269 | 1.69125 | -0.09144 | ms | -5.13% |
| 90th percentile service time | match-all | 1.96422 | 1.93127 | -0.03296 | ms | -1.68% |
| 99th percentile service time | match-all | 2.10423 | 2.1777 | 0.07347 | ms | +3.49% |
| 100th percentile service time | match-all | 4.14652 | 4.43299 | 0.28647 | ms | +6.91% |
| error rate | match-all | 0 | 0 | 0 | % | 0.00% |
| Min Throughput | nested-date-histo | 1.00101 | 1.00101 | 0 | ops/s | 0.00% |
| Mean Throughput | nested-date-histo | 1.00167 | 1.00167 | 0 | ops/s | 0.00% |
| Median Throughput | nested-date-histo | 1.00152 | 1.00152 | 0 | ops/s | 0.00% |
| Max Throughput | nested-date-histo | 1.00303 | 1.00303 | -0 | ops/s | -0.00% |
| 50th percentile latency | nested-date-histo | 282.853 | 270.184 | -12.6691 | ms | -4.48% |
| 90th percentile latency | nested-date-histo | 311.829 | 292.716 | -19.1128 | ms | -6.13% |
| 99th percentile latency | nested-date-histo | 336.068 | 314.334 | -21.7337 | ms | -6.47% |
| 100th percentile latency | nested-date-histo | 346.683 | 329.301 | -17.3822 | ms | -5.01% |
| 50th percentile service time | nested-date-histo | 281.092 | 267.599 | -13.4937 | ms | -4.80% |
| 90th percentile service time | nested-date-histo | 310.248 | 290.134 | -20.1146 | ms | -6.48% |
| 99th percentile service time | nested-date-histo | 333.707 | 312.313 | -21.3942 | ms | -6.41% |
| 100th percentile service time | nested-date-histo | 343.8 | 326.279 | -17.5209 | ms | -5.10% |
| error rate | nested-date-histo | 0 | 0 | 0 | % | 0.00% |
| Min Throughput | randomized-nested-queries-with-inner-hits_default | 18.0011 | 18.002 | 0.00093 | ops/s | 0.01% |
| Mean Throughput | randomized-nested-queries-with-inner-hits_default | 18.0018 | 18.0033 | 0.00151 | ops/s | 0.01% |
| Median Throughput | randomized-nested-queries-with-inner-hits_default | 18.0016 | 18.003 | 0.00136 | ops/s | 0.01% |
| Max Throughput | randomized-nested-queries-with-inner-hits_default | 18.0033 | 18.006 | 0.00266 | ops/s | +0.01% |
| 50th percentile latency | randomized-nested-queries-with-inner-hits_default | 9.33778 | 9.16204 | -0.17574 | ms | -1.88% |
| 90th percentile latency | randomized-nested-queries-with-inner-hits_default | 13.779 | 12.7832 | -0.99575 | ms | -7.23% |
| 99th percentile latency | randomized-nested-queries-with-inner-hits_default | 16.5619 | 15.1976 | -1.36429 | ms | -8.24% |
| 99.9th percentile latency | randomized-nested-queries-with-inner-hits_default | 21.0171 | 18.444 | -2.57306 | ms | -12.24% |
| 100th percentile latency | randomized-nested-queries-with-inner-hits_default | 38.5041 | 53.1761 | 14.672 | ms | +38.10% |
| 50th percentile service time | randomized-nested-queries-with-inner-hits_default | 8.1312 | 8.00942 | -0.12179 | ms | -1.50% |
| 90th percentile service time | randomized-nested-queries-with-inner-hits_default | 12.4433 | 11.3684 | -1.07485 | ms | -8.64% |
| 99th percentile service time | randomized-nested-queries-with-inner-hits_default | 14.9342 | 13.8073 | -1.12687 | ms | -7.55% |
| 99.9th percentile service time | randomized-nested-queries-with-inner-hits_default | 20.3188 | 17.5598 | -2.75896 | ms | -13.58% |
| 100th percentile service time | randomized-nested-queries-with-inner-hits_default | 36.7036 | 50.9294 | 14.2258 | ms | +38.76% |
| error rate | randomized-nested-queries-with-inner-hits_default | 0 | 0 | 0 | % | 0.00% |
| Min Throughput | randomized-nested-queries-with-inner-hits_default_big_size | 16.0036 | 16.0036 | -4e-05 | ops/s | -0.00% |
| Mean Throughput | randomized-nested-queries-with-inner-hits_default_big_size | 16.0059 | 16.0059 | 1e-05 | ops/s | 0.00% |
| Median Throughput | randomized-nested-queries-with-inner-hits_default_big_size | 16.0054 | 16.0054 | -3e-05 | ops/s | -0.00% |
| Max Throughput | randomized-nested-queries-with-inner-hits_default_big_size | 16.0107 | 16.0107 | -1e-05 | ops/s | -0.00% |
| 50th percentile latency | randomized-nested-queries-with-inner-hits_default_big_size | 14.6701 | 14.1773 | -0.49279 | ms | -3.36% |
| 90th percentile latency | randomized-nested-queries-with-inner-hits_default_big_size | 18.7551 | 17.6513 | -1.10384 | ms | -5.89% |
| 99th percentile latency | randomized-nested-queries-with-inner-hits_default_big_size | 21.688 | 20.1119 | -1.57607 | ms | -7.27% |
| 99.9th percentile latency | randomized-nested-queries-with-inner-hits_default_big_size | 27.0588 | 24.9349 | -2.12389 | ms | -7.85% |
| 100th percentile latency | randomized-nested-queries-with-inner-hits_default_big_size | 27.8401 | 27.702 | -0.1381 | ms | -0.50% |
| 50th percentile service time | randomized-nested-queries-with-inner-hits_default_big_size | 13.7254 | 13.3343 | -0.39111 | ms | -2.85% |
| 90th percentile service time | randomized-nested-queries-with-inner-hits_default_big_size | 17.7411 | 16.6371 | -1.10401 | ms | -6.22% |
| 99th percentile service time | randomized-nested-queries-with-inner-hits_default_big_size | 20.5643 | 19.1875 | -1.37689 | ms | -6.70% |
| 99.9th percentile service time | randomized-nested-queries-with-inner-hits_default_big_size | 26.3849 | 24.3221 | -2.06277 | ms | -7.82% |
| 100th percentile service time | randomized-nested-queries-with-inner-hits_default_big_size | 26.6912 | 27.2608 | 0.56967 | ms | +2.13% |
| error rate | randomized-nested-queries-with-inner-hits_default_big_size | 0 | 0 | 0 | % | 0.00% |
@ChrisHegarty I'm assuming the 100th percentile have to do with startup. If that's the case, it makes sense that Java 23 has regressed a bit (as more code needs to be executed). We took care of this in Java 24 (by getting rid of a lot of redundant var handle adapters) - so I'd expect these 100th percentile numbers to improve there.
Thanks @mcimadamore.
ES version: main / lucene_snapshot
JDK: openjdk version "23" 2024-09-17 OpenJDK Runtime Environment (build 23+37-2369) OpenJDK 64-Bit Server VM (build 23+37-2369, mixed mode, sharing)
OS:
Bug: Lucene memory segment input slows down with JDK 23. This applies to both Lucene 9.11.1 as well as most recent 9.12.0 snapshot. The regression was noticed in nightly benchmarks when switched temporarily to JDK 23 on September 4th (rolled back on September 6th).
For example nightly benchmark for Nested Rally track regressed as follows for
nested-date-histo
query (src) latency:(dashboard)
Wall clock profiles collected when running this query continuously show that ES spends way more time in methods such as
org.apache.lucene.store.MemorySegmentIndexInput.readLong()
.JDK 22:
JDK 23:
wall-clock-profiles.zip