Closed bwesterb closed 1 year ago
Without overhead, with the current code that uses the slow scalar keccak, we should be able to reach:
>>> 42 * 1024 / (145e-9 + 128*72e-9) / 1e6
4594.380942207029
Using pprof, I see 1/3 of the time is spent in writeX2
/LittleEndian.Uint64: interleaving and XORing the data into the buffers isn't free.
goos: darwin
goarch: amd64
pkg: github.com/cloudflare/circl/xof/k12
cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
BenchmarkK12_100B-8 3521196 348.1 ns/op 287.26 MB/s
BenchmarkK12_10K-8 63129 18746 ns/op 533.45 MB/s
BenchmarkK12_100K-8 13159 90725 ns/op 1102.23 MB/s
BenchmarkK12_1M-8 2001 591119 ns/op 1691.71 MB/s
BenchmarkK12_10M-8 206 5876949 ns/op 1701.56 MB/s
PASS
ok github.com/cloudflare/circl/xof/k12 8.344s
Theoretical max is 2230 MB/s. Interleaving and XORing is still expensive, it's just less so compared to the speed of Keccak here.
@armfazh I addressed all your comments. Please have another look.
On M2 Pro:
For comparison: