New C++ version 3x faster on 10k key dataset

buybackoff / 1brc

1BRC in .NET among fastest on Linux

https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-fastest-on-linux-my-optimization-journey/

MIT License

437 stars 43 forks source link

New C++ version 3x faster on 10k key dataset #13

Closed lehuyduc closed 8 months ago

lehuyduc commented 8 months ago

https://github.com/lehuyduc/1brc-simd

Hi, I've updated my code to optimize for the 10k keys dataset. On my PC it's ~3x faster (excluding munmap time) than the commit you tested. Default dataset performance is a bit slower.

Just ./run_cpp.sh to compile and run.

To test the effect of hyper threading, you can do ./run_cpp.sh 12 12 (12 == number of threads total on your CPU). You will see interesting effects on the 10K dataset :D

Thanks! Looking forwards to your updated result.

buybackoff commented 8 months ago

ReallyOhSeriouslyGIF

buybackoff commented 8 months ago

@lehuyduc I assume you checked the output vs correct one? I'm too lazy to redo that every time.

lehuyduc commented 8 months ago

Yes. I tested on 3 different measurements.txt files, and they're correct. If I find any new error, i'll fix it.

./run_cpp.sh 12 12 could you test the result of this one too? To see how hyper threading is bad for performance when there's many branch miss or L3 cache miss.

buybackoff commented 8 months ago

./run_cpp.sh 12 12 vs ./run_cpp.sh 12 6 are not so much different, it would be the same second decimal even if the delta > sigma. Didn't look too deeply into that.

lehuyduc commented 8 months ago

Huh, so I guess this is an AMD specific problem. If I run with all virtual threads on 2950X, it's much slower, like 30+% slower. Anyway, thanks for testing!

buybackoff commented 8 months ago

The blog update is now deploying