RagnarGrootKoerkamp / research

Blog on PhD @ BMI ETH Zurich
https://curiouscoding.nl
8 stars 0 forks source link

posts/1brc/ #4

Open utterances-bot opened 6 months ago

utterances-bot commented 6 months ago

One Billion Row Challenge · CuriousCoding

Table of Contents The problem Initial solution: 105s First flamegraph Bytes instead of strings: 72s Manual parsing: 61s Inline hash keys: 50s Faster hash function: 41s A new flame graph Perf it is Something simple: allocating the right size: 41s memchr for scanning: 47s memchr crate: 29s get_unchecked: 28s Manual SIMD: 29s Profiling Revisiting the key function: 23s PtrHash perfect hash function: 17s Larger masks: 15s Reduce pattern matching: 14s Memory map: 12s Parallelization: 2.

https://curiouscoding.nl/posts/1brc/

lehuyduc commented 6 months ago

Hi, love your implementation! We have many similar ideas :D

Could you help me test my version on your machine? I'm very interested in hyper threading ON vs OFF. I wish to test 6 cases total: thread (1, 6, 12) * HT (on, off). I can't turn off HT in my test PC (but it has 128 threads, if you want to test)

You can find my repo as well as the test file through here:

https://github.com/gunnarmorling/1brc/discussions/138

ants commented 6 months ago

Tried it out on an AMD Ryzen 9 3900X after swapping out the PDEP instruction with series of masks and shifts. Zen 2 implements pdep via microcode and is dog slow, 80+% execution time on that single instruction. But after that was done, parallelism 24 on a non-idle desktop system got me 887ms as reported by time (587ms measured in Rust).

lehuyduc commented 6 months ago

Huh, 32 threads on 2950X is exactly as fast as 24 threads on 3900X :D What a coincident

RagnarGrootKoerkamp commented 6 months ago

Very sweet! This pdep situation is quite unfortunate. Masks&shifts is actually pretty straightforward here so maybe I'll try that as well and see the perf difference, or at least add it as a fallback option for others.

I will try unmapping the data per thread instead, that may reduce the difference.

At around half a second RAM speed is likely gonna be the bottleneck. The input it 13GB and typical RAM speed caps around 25GB/s I think.

lehuyduc commented 6 months ago

I think I found the reason why your code is slower with HT on (or at least doesn't scale as well). I tried to optimize for 10K case and started using __m256 in my code, and performance at 32 threads (on a 16c32t cpu) immediately tanked, by a huge amount. user time at 16 threads is 0m25.961s, but at 32 threads it's 1m36.769s. In my previous code that only uses __m128, user time difference is only ~10%. Also running perf stat -d shows frontend cycle idle going from 3% to 18%, instruction per cycle from 1.4 to 0.4. ~So I think the culprit is __m256 using 2x more vector registers than __m128i~

Edit: actually no, removed __m256 and still got the same result. Maybe it's something to do with L3 cache, but my CPU doesn't support measuring it in perf T_T

Edit 2: in my case HT is only slower when running the 10k dataset. So it's very likely something to do with L3 cache