Wrong result on test case, benchmark on 2950X

lehuyduc commented 10 months ago

Hi, I tested your code with an original officially generated test case (which still follows all your extra assumptions), but it gives a lot of wrong average value (off-by-one error) and maybe some others. The input file and the result_ref.txt can be downloaded here: https://github.com/lehuyduc/1brc-simd

I tried 3 differents commit: pdep parsing, latest cleanup, and fix simd imports for latest nightly, but they all give wrong results. I also benchmark 2 of them.

Could you check what's wrong? Thanks!

Also, if you upload your measurements.txt file, I can test it on my PC for better comparison with your results.

Example of wrong values: In pdep parsing and cleanup commit

Veracruz: 37.3/37.3/37.3
Abha: -37.5/17.9/69.9
Abidjan: -30.0/25.9/78.1
Abéché: -23.6/29.4/81.0
Accra: -23.1/26.3/75.5
...
Veracruz: -22.7/25.4/77.6 <--- this appear 2 times with 2 different results!!! Hash map error?

Reference

{Abha=-37.5/18.0/69.9, Abidjan=-30.0/26.0/78.1, Abéché=-23.6/29.4/81.0, Accra=-23.1/26.4/75.5, ... Veracruz=-22.7/25.4/77.6,

Run command I use is below.

cargo run --quiet --release -- measurements.txt --print > log.txt 2>&1
time ./target/release/one-billion-row-challenge measurements.txt --print

I set number of threads manually (then compile again each time):

let N_THREADS: usize = 6;

let records = run_parallel(
    data,
    &phf,
    num_slots,
    N_THREADS
);

Benchmark on 2950X, 2133MHz quad channel RAM, 3.65 GHz (32 thread) to 4.3 GHz (1 thread)

Commit pdep parsing 021bed3738533a6d08aab7bfd5d936b92b1c029e

32
total: 1.81s
real    0m1.965s
user    0m54.607s
sys 0m0.556s

16
total: 2.59s
real    0m2.745s
user    0m39.422s
sys 0m0.431s

12
total: 3.25s
real    0m3.412s
user    0m38.312s
sys 0m0.528s

8
total: 4.84s
real    0m4.994s
user    0m38.113s
sys 0m0.488s

6
total: 6.38s
real    0m6.537s
user    0m37.801s
sys 0m0.396s

4
total: 9.52s
real    0m9.678s
user    0m37.648s
sys 0m0.439s

1
total: 37.05s
real    0m37.211s
user    0m36.893s
sys 0m0.332s

Commit cleanup 01e4bc3efb5184cf285b2f38dc1ac1fff5d640e0

32
total: 589.26ms
real    0m0.744s
user    0m16.016s
sys 0m0.689s

16
total: 857.19ms
real    0m1.016s
user    0m12.893s
sys 0m0.512s

12
total: 1.11s
real    0m1.274s
user    0m12.684s
sys 0m0.480s

6
total: 2.12s
real    0m2.277s
user    0m12.399s
sys 0m0.417s

1
total: 12.19s
real    0m12.345s
user    0m12.004s
sys 0m0.365s

RagnarGrootKoerkamp commented 10 months ago

right I think I do rounding to 0 not actual rounding. Not a big issue.
cities appearing twice is definitely weird though. Will check.
from these timings I'd infer that your CPU doesn't support PDEP maybe? Does it have the BMI2 feature flag? Maybe it's emulated by the compiler now which could be slow. I'll try to update the code to print a warning if it's not supported.
you can just run with -j1 to run only 1 thread.
I'll download your input and use that from now.
the remaining performance difference is probably just differences in CPU. It does seem that yours has more L1 cache per thread (48kB Vs my 32kB), but that doesn't explain the slower speed.

RagnarGrootKoerkamp commented 10 months ago

Oh right this: https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#%3A%7E%3Atext%3Dwithout_affecting_flags-%2CParallel_bit_deposit_and_extract%2Cto_be_packed_or_unpacked.?wprov=sfla1

AMD processors before Zen 3 that implement PDEP and PEXT do so in microcode, with a latency of 18 cycles rather than (Zen 3) 3 cycles. As a result it is often faster to use other instructions on these processors.

lehuyduc commented 10 months ago

Great finding! Also, can you run 6 thread vs 12 thread both with HT on? Your benchmark currently only has 6t HT off vs 12t HT on.

The difference is ~60% for both of ours code on AMD 2950X. I'm curious what's the number on an Intel CPU

RagnarGrootKoerkamp commented 10 months ago

fixed the wrong rounding and the initial bad city was an off-by-one in the initialization.

I don't think I will make non-PDEP code, since it uses a different data storage format.

I'll make a new benchmark at some later point today.

RagnarGrootKoerkamp commented 10 months ago

Added latest results here: https://curiouscoding.nl/posts/1brc/

lehuyduc commented 10 months ago

Thanks! Now the performance numbers make more sense (single thread HT vs no HT not much difference).

It seems Intel's HT is just worse than AMD's.

RagnarGrootKoerkamp / 1brc

Wrong result on test case, benchmark on 2950X #1