AugustNagro / utf8.java

Vectorized UTF-8 Validation for Java
62 stars 7 forks source link

utf8.java

Vectorized UTF-8 validation & benchmarks, written in Java.

Based on the paper by John Keiser and Daniel Lemire, with minor modifications.

Verify Correctness

Make sure to have Java 22 installed. Then execute:

mvn compile assembly:single && \
java --enable-preview --add-modules jdk.incubator.vector \
-jar target/utf8.java-1.0-SNAPSHOT-jar-with-dependencies.jar [optional list of space-delineated file paths]

With no arguments, this will run the UTF-8 validator on 4 source files:

Running Benchmarks

mvn verify && java -jar target/benchmarks.jar

The JMH benchmarks use the same 4 test files mentioned above, at 3 vector lengths: 128 bit, 256, and 512. Most likely your hardware does not support 512 bit vectors, so these benchmarks fallback to the slow array-based implementation.jdk_decode uses the JDK's new String(buf, UTF_8). This constructor produces a new String in addition to validation, but is good enough for a baseline.

Performance

Throughput for twitter.json as of 2024-06-19:

new String(buf, UTF_8) Utf8.validate(buf, new LookupTables256()) simdjson::validate_utf8(str, len)
.96 GB/sec 11.44 GB/sec 24 GB/sec (from paper, not recently tested)

Conclusion