Closed lemire closed 2 months ago
On my ARM-based laptop, the scalar trick does not pay (since it is limited to 128-bit).... but you can still see the massive performance gain compared to scalar with vector_128. In that instance, the JDK 'string creation' trick is faster, likely due to the fact that we have some good optimization in the standard library.
BenchJDK.jdk /twitter.json thrpt 3857.591 ops/s
BenchJDK.scalar /twitter.json thrpt 1977.185 ops/s
BenchJDK.vector_128 /twitter.json thrpt 19250.184 ops/s
BenchJDK.vector_256 /twitter.json thrpt 42.616 ops/s
BenchJDK.vector_512 /twitter.json thrpt 26.642 ops/s
In terms of GB/s, we have
jdk 2.5 GB/s
vector_128 12 GB/s
Again, fairly reasonable results.
My results are several times better than the ones you report in the README...
My CPUs are not special though they are recent.... still you should be getting better results....
I am going to close this as it was just for fun.
Hey I'd love to merge this MR.. this whole repo is just for fun, at least until the vector API stabilizes.
This replaces a 25-minute benchmark with a 19 seconds one
Yeah totally agree it can be shortened up. My machine requires a bit longer it, but I can explore and adjust that in a separate MR.
An interesting addition in this PR is a scalar validation function, which allows you to see directly the benefits of the SIMD operations.
Awesome, thanks. Those are excellent additions.
drastically improves the performance in some cases
I ran the the new benchmark with
openjdk 22.0.1+8-16
@Warmup(iterations = 3, time = 5)
.And got these results without the code-change in Utf8.java:
Benchmark (testFile) Mode Cnt Score Error Units
BenchJDK.jdk /twitter.json thrpt 1541.310 ops/s
BenchJDK.scalar /twitter.json thrpt 2519.318 ops/s
BenchJDK.vector_128 /twitter.json thrpt 5802.873 ops/s
BenchJDK.vector_256 /twitter.json thrpt 7525.890 ops/s
BenchJDK.vector_512 /twitter.json thrpt 4861.439 ops/s
With your changes:
Benchmark (testFile) Mode Cnt Score Error Units
BenchJDK.jdk /twitter.json thrpt 1519.961 ops/s
BenchJDK.scalar /twitter.json thrpt 2704.991 ops/s
BenchJDK.vector_128 /twitter.json thrpt 15619.757 ops/s
BenchJDK.vector_256 /twitter.json thrpt 18113.341 ops/s
BenchJDK.vector_512 /twitter.json thrpt 4739.419 ops/s
Epic! If you re-open I'd be happy to approve & merge.
Re-opened as requested.
Merged; I'll update the readme as well.
I used the OneBranchTooMany benchmark to identify an issue where a branch would cause the vector intrinsics to fail (causing them to box).
https://mail.openjdk.org/pipermail/panama-dev/2020-December/011650.html
https://mail.openjdk.org/pipermail/panama-dev/2020-December/011651.html
It has been fixed in openjdk now, so I'll remove it.
This replaces a 25-minute benchmark with a 19 seconds one, and it drastically improves the performance in some cases. I have not updated the README, and I image you won't want to merge the PR as is, but the performance boost is significant for some cases.
On an Ice Lake server (with 512-bit registers)....
This PR:
In GB/s, it is...
These are reasonable results !!!
Previous results (same machine):
So, in my tests, I triple the performance for
vector_128
and double it forvector_256
.An interesting addition in this PR is a scalar validation function, which allows you to see directly the benefits of the SIMD operations.
This PR also retains just one benchmark file, because, again, I want quick benchmarks as I do not like to wait several minutes for the results. Ultimately, one would want the input file to be a parameter but I could not get it to work using the instructions in the README (the files are seemingly never found).
The important point here is that the performance can be made better.