On my server (AVX-512 capable Ice Lake), the 512-bit routine achieves 18 GB/s with twitter.json. That's much slower than C code which gets to 70 GB/s on the same file and using the same hardware, but 18 GB/s is still quite nice.
I recommend making the benchmarks faster to run for quality of life. The benchmarks currently take 25 minutes to run on my server. A single warming run is enough, and a single testing run is enough. Such a change would speed up by a factor of five the benchmarks, and that would still be way too long (5 minutes).
On my server (AVX-512 capable Ice Lake), the 512-bit routine achieves 18 GB/s with twitter.json. That's much slower than C code which gets to 70 GB/s on the same file and using the same hardware, but 18 GB/s is still quite nice.
I recommend making the benchmarks faster to run for quality of life. The benchmarks currently take 25 minutes to run on my server. A single warming run is enough, and a single testing run is enough. Such a change would speed up by a factor of five the benchmarks, and that would still be way too long (5 minutes).