utf8.java

Vectorized UTF-8 validation & benchmarks, written in Java.

Based on the paper by John Keiser and Daniel Lemire, with minor modifications.

Verify Correctness

Make sure to have Java 22 installed. Then execute:

mvn compile assembly:single && \
java --enable-preview --add-modules jdk.incubator.vector \
-jar target/utf8.java-1.0-SNAPSHOT-jar-with-dependencies.jar [optional list of space-delineated file paths]

With no arguments, this will run the UTF-8 validator on 4 source files:

twitter.json: 631_515 bytes, frequent multi-byte characters. Taken from here.
utf8-demo.txt: 13_459 bytes, many special / tricky utf8 characters. From w3.org.
utf8-demo-invalid.txt: same as utf8-demo.txt, but with one error.
20k.txt: 3.8 MB, all ascii.

Running Benchmarks

mvn verify && java -jar target/benchmarks.jar

The JMH benchmarks use the same 4 test files mentioned above, at 3 vector lengths: 128 bit, 256, and 512. Most likely your hardware does not support 512 bit vectors, so these benchmarks fallback to the slow array-based implementation.jdk_decode uses the JDK's new String(buf, UTF_8). This constructor produces a new String in addition to validation, but is good enough for a baseline.

Performance

Throughput for twitter.json as of 2024-06-19:

`new String(buf, UTF_8)`	`Utf8.validate(buf, new LookupTables256())`	`simdjson::validate_utf8(str, len)`
.96 GB/sec	11.44 GB/sec	24 GB/sec (from paper, not recently tested)

The JDK algorithm is very optimized, and uses intrinsics to check negatives (for the ASCII shortcut) and to elide array bound checks.
In the vectorized algorithm, 256 bit vectors currently perform best. We cannot go smaller than 128 bit, since nibbles (4 bits) are used to select from the lookup tables.

Conclusion

The Vector api is expressive and a pleasure to use. Performance is getting better.
Abstracting over ISA and even vector Shape is incredible for portability, given how fragmented vector instruction sets are.
The dissonance between Vector and ByteVector is a little annoying.
If I had benchmarked iteratively while developing, I could've discovered the causes of slowdown sooner.
I wish there was a (documented) debug/logging flag.
Vector::selectInto is awesome for lookup tables.
The project's JavaDoc is one of the best introductions to vectorization on the internet.
Debugging works great! I do wish we could make Vector::toString print hex instead of base 10 by default.
Would have been nice if performance was a success story, but failure is educational.
I think someone could implement simd-json in Java if they wanted to. Would it be fast? At least not for now.

AugustNagro / utf8.java

readme

utf8.java

Verify Correctness

Running Benchmarks

Performance

Conclusion