Benchmark ECC implementations

Bulat-Ziganshin commented 2 years ago

Build a benchmark comparing performance of Leopard, FastECC, ISA-L and CM256.

Some existing benchmarks:

Leopard on 1000+100 config: https://hackmd.io/Jek5a2MaQEKHfcrXkXXRqw?view

Bulat-Ziganshin commented 2 years ago

And from ISA-L benchmarking docs, it seems that they have speed equivalent to Leopard at 50 parity blocks (less parity blocks give advantage to ISA-L and more of them to Leopard)

they provided raw numbers: https://01.org/sites/default/files/documentation/intel_isa-l_2.19_performance_report_0_0.pdf . AVX2 perf is on page 10 - they reached 0.40 cycles per byte (c/b) for 10+4 config, i.e. 0.1 c/b per parity block.

I rechecked it based on the algorithm code: it's just 5 SIMD operations to compute 32 bytes (with AVX2) of single output parity block. 5 SIMD operations can be executed in 2.5 cpu cycles in ideal case, thus 2.5/32 = 0.08 cycles per byte (c/b).

ISA-L is slightly slower at about 0.1 c/b per parity block, so for 50 parity blocks it will be 5 c/b. At 2.5 GHz it will be 500 MB/s per core - Catid states 2 GB/s for unknown laptop (afair he had top 4-core haswell): https://github.com/catid/leopard/blob/master/Benchmarks.md

Your data don't provide 100% info to identify working freqs, but both i3-10110U and i3-1115G4 have 2 cores with max freq 4.1 GHz, so with ISA-L they should reach 2 GB/s for 50 parity blocks, while on your Leopard test first one delivers 1.5 GB/s and second one 3.5 GB/s: https://hackmd.io/Jek5a2MaQEKHfcrXkXXRqw?view , so I wonder whether bench is incorrect or 11th gen is really much faster - why?

Bulat-Ziganshin commented 2 years ago

Implemented in https://github.com/Bulat-Ziganshin/ECC-Benchmark

Currently includes CM256, Leopard and FastECC

codex-storage / nim-codex

Benchmark ECC implementations #136