Simpler faster (illustration, for fun)

lemire commented 2 months ago

This replaces a 25-minute benchmark with a 19 seconds one, and it drastically improves the performance in some cases. I have not updated the README, and I image you won't want to merge the PR as is, but the performance boost is significant for some cases.

On an Ice Lake server (with 512-bit registers)....

This PR:

Benchmark               (testFile)   Mode  Cnt      Score   Error  Units
BenchJDK.jdk         /twitter.json  thrpt         952.397          ops/s
BenchJDK.scalar      /twitter.json  thrpt        2234.691          ops/s
BenchJDK.vector_128  /twitter.json  thrpt       14630.812          ops/s
BenchJDK.vector_256  /twitter.json  thrpt       17697.810          ops/s
BenchJDK.vector_512  /twitter.json  thrpt       31678.039          ops/s

In GB/s, it is...

jdk                  0.6 GB/s
scalar             1.4 GB/s
vector_128     9 GB/s
vector_256     11 GB/s
vector_512     20 GB/s

These are reasonable results !!!

Previous results (same machine):

Benchmark                          (branchPrediction)              (testFile)   Mode  Cnt       Score        Error  Units
BenchJDK.jdk                                      N/A           /twitter.json  thrpt    5    1730.981 ±     10.969  ops/s
BenchJDK.vector_128                               N/A           /twitter.json  thrpt    5    5312.508 ±     59.222  ops/s
BenchJDK.vector_256                               N/A           /twitter.json  thrpt    5    8186.286 ±     47.844  ops/s
BenchJDK.vector_512                               N/A           /twitter.json  thrpt    5   31273.521 ±    227.835  ops/s

So, in my tests, I triple the performance for vector_128 and double it for vector_256.

An interesting addition in this PR is a scalar validation function, which allows you to see directly the benefits of the SIMD operations.

This PR also retains just one benchmark file, because, again, I want quick benchmarks as I do not like to wait several minutes for the results. Ultimately, one would want the input file to be a parameter but I could not get it to work using the instructions in the README (the files are seemingly never found).

The important point here is that the performance can be made better.

lemire commented 2 months ago

On my ARM-based laptop, the scalar trick does not pay (since it is limited to 128-bit).... but you can still see the massive performance gain compared to scalar with vector_128. In that instance, the JDK 'string creation' trick is faster, likely due to the fact that we have some good optimization in the standard library.

BenchJDK.jdk         /twitter.json  thrpt        3857.591          ops/s
BenchJDK.scalar      /twitter.json  thrpt        1977.185          ops/s
BenchJDK.vector_128  /twitter.json  thrpt       19250.184          ops/s
BenchJDK.vector_256  /twitter.json  thrpt          42.616          ops/s
BenchJDK.vector_512  /twitter.json  thrpt          26.642          ops/s

In terms of GB/s, we have

jdk            2.5 GB/s
vector_128  12 GB/s

Again, fairly reasonable results.

lemire commented 2 months ago

My results are several times better than the ones you report in the README...

My CPUs are not special though they are recent.... still you should be getting better results....

lemire commented 2 months ago

I am going to close this as it was just for fun.

AugustNagro commented 2 months ago

Hey I'd love to merge this MR.. this whole repo is just for fun, at least until the vector API stabilizes.

This replaces a 25-minute benchmark with a 19 seconds one

Yeah totally agree it can be shortened up. My machine requires a bit longer it, but I can explore and adjust that in a separate MR.

An interesting addition in this PR is a scalar validation function, which allows you to see directly the benefits of the SIMD operations.

Awesome, thanks. Those are excellent additions.

drastically improves the performance in some cases

I ran the the new benchmark with

openjdk 22.0.1+8-16
i9-10940X (cascade lake, has slow avx-512)
@Warmup(iterations = 3, time = 5).

And got these results without the code-change in Utf8.java:

Benchmark               (testFile)   Mode  Cnt     Score   Error  Units
BenchJDK.jdk         /twitter.json  thrpt       1541.310          ops/s
BenchJDK.scalar      /twitter.json  thrpt       2519.318          ops/s
BenchJDK.vector_128  /twitter.json  thrpt       5802.873          ops/s
BenchJDK.vector_256  /twitter.json  thrpt       7525.890          ops/s
BenchJDK.vector_512  /twitter.json  thrpt       4861.439          ops/s

With your changes:

Benchmark               (testFile)   Mode  Cnt      Score   Error  Units
BenchJDK.jdk         /twitter.json  thrpt        1519.961          ops/s
BenchJDK.scalar      /twitter.json  thrpt        2704.991          ops/s
BenchJDK.vector_128  /twitter.json  thrpt       15619.757          ops/s
BenchJDK.vector_256  /twitter.json  thrpt       18113.341          ops/s
BenchJDK.vector_512  /twitter.json  thrpt        4739.419          ops/s

Epic! If you re-open I'd be happy to approve & merge.

lemire commented 2 months ago

Re-opened as requested.

AugustNagro commented 2 months ago

Merged; I'll update the readme as well.

I used the OneBranchTooMany benchmark to identify an issue where a branch would cause the vector intrinsics to fail (causing them to box).

https://mail.openjdk.org/pipermail/panama-dev/2020-December/011650.html

https://mail.openjdk.org/pipermail/panama-dev/2020-December/011651.html

It has been fixed in openjdk now, so I'll remove it.

AugustNagro / utf8.java

Simpler faster (illustration, for fun) #9