Vectorized UTF-8 validation & benchmarks, written in Java.
Based on the paper by John Keiser and Daniel Lemire, with minor modifications.
Make sure to have Java 22 installed. Then execute:
mvn compile assembly:single && \
java --enable-preview --add-modules jdk.incubator.vector \
-jar target/utf8.java-1.0-SNAPSHOT-jar-with-dependencies.jar [optional list of space-delineated file paths]
With no arguments, this will run the UTF-8 validator on 4 source files:
mvn verify && java -jar target/benchmarks.jar
The JMH benchmarks use the same 4 test files mentioned above, at 3 vector lengths: 128 bit, 256, and 512. Most likely your hardware does not support 512 bit vectors, so these benchmarks fallback to the slow array-based implementation.jdk_decode
uses the JDK's new String(buf, UTF_8)
. This constructor produces a new String in addition to validation, but is good enough for a baseline.
Throughput for twitter.json
as of 2024-06-19:
new String(buf, UTF_8) |
Utf8.validate(buf, new LookupTables256()) |
simdjson::validate_utf8(str, len) |
---|---|---|
.96 GB/sec | 11.44 GB/sec | 24 GB/sec (from paper, not recently tested) |
The JDK algorithm is very optimized, and uses intrinsics to check negatives (for the ASCII shortcut) and to elide array bound checks.
In the vectorized algorithm, 256 bit vectors currently perform best. We cannot go smaller than 128 bit, since nibbles (4 bits) are used to select from the lookup tables.