Closed lehuyduc closed 10 months ago
AMD processors before Zen 3 that implement PDEP and PEXT do so in microcode, with a latency of 18 cycles rather than (Zen 3) 3 cycles. As a result it is often faster to use other instructions on these processors.
Great finding! Also, can you run 6 thread vs 12 thread both with HT on? Your benchmark currently only has 6t HT off vs 12t HT on.
The difference is ~60% for both of ours code on AMD 2950X. I'm curious what's the number on an Intel CPU
fixed the wrong rounding and the initial bad city was an off-by-one in the initialization.
I don't think I will make non-PDEP code, since it uses a different data storage format.
I'll make a new benchmark at some later point today.
Added latest results here: https://curiouscoding.nl/posts/1brc/
Thanks! Now the performance numbers make more sense (single thread HT vs no HT not much difference).
It seems Intel's HT is just worse than AMD's.
Hi, I tested your code with an original officially generated test case (which still follows all your extra assumptions), but it gives a lot of wrong average value (off-by-one error) and maybe some others. The input file and the
result_ref.txt
can be downloaded here: https://github.com/lehuyduc/1brc-simdI tried 3 differents commit:
pdep parsing
, latestcleanup
, andfix simd imports for latest nightly
, but they all give wrong results. I also benchmark 2 of them.Could you check what's wrong? Thanks!
Also, if you upload your
measurements.txt
file, I can test it on my PC for better comparison with your results.Example of wrong values: In
pdep parsing
andcleanup
commitReference
Run command I use is below.
I set number of threads manually (then compile again each time):
Benchmark on 2950X, 2133MHz quad channel RAM, 3.65 GHz (32 thread) to 4.3 GHz (1 thread)
Commit
pdep parsing
021bed3738533a6d08aab7bfd5d936b92b1c029eCommit
cleanup
01e4bc3efb5184cf285b2f38dc1ac1fff5d640e0