buybackoff / 1brc

1BRC in .NET among fastest on Linux
https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-fastest-on-linux-my-optimization-journey/
MIT License
437 stars 43 forks source link

Add noahfalk entry #16

Closed noahfalk closed 7 months ago

noahfalk commented 8 months ago

Hi @buybackoff - thanks for the great article and running this unofficial leaderboard. Would you mind including my entry in your run? https://github.com/noahfalk/1brc/

Its quite speedy from my own benchmarking and I hope the results will be reproducible for you. I put some more info in the README on my repo but if you have any questions about it feel free to reach out. Thanks!

buybackoff commented 8 months ago

This is an awesome next level! How is that even possible at all!? 😄

image

You left everyone so far behind! This result is comparable to the Default/JIT column in my tables in the blog post. While I update my numbers in a proper way, could you, please, run my latest commit and Java's artsiomkorzun/abeobk and update your table? That will establish some baseline for your machine and will allow for relative scaling and transitivity check. An additional independent run will be very helpful.

@lehuyduc FIY

SurpriseChrisPrattGIF

lehuyduc commented 8 months ago

I really need to look into parsing multiple temperatures at once :sob: But wait a bit, a patch is coming that should make the gap closer

noahfalk commented 8 months ago

I really need into parsing multiple temperatures at once 😭 But wait a bit, a patch is coming that should make the gap closer

I'm looking forward to it! I don't see why you wouldn't be able to replicate anything my solution is doing that is useful and get a comparable (or even better!) number.

This is an awesome next level! How is that even possible at all!? 😄

Haha, thanks. In part I suspect the answer is being willing to chase the rabbit way down the hole and write a bunch of code : )

could you, please, run my latest commit and Java's artsiomkorzun/abeobk and update your table?

Sure, that will probably be later tonight before I can get that set up.

lehuyduc commented 8 months ago

Btw, how do you handle munmap? I think all Java solutions use subprocess and exit after the result is printed to basically skip through munmap. Edit: oh nvm, you use file reading completely

buybackoff commented 8 months ago

@lehuyduc I've just experimented with subprocess, I see an improvement, but it's much less than another approach: on every thread I split the big per-thread chunks into two: a bigger one 50% + (thread_id/num_threads) * 49% and what's left. Every bigger chunks is unmapped at a different time and does not cause a lock for other threads. And smaller chunks do the same and they are much smaller and faster to unmap.

noahfalk commented 8 months ago

Edit: oh nvm, you use file reading completely

Yep. If you do want to see how my entry performs with memory mapping there is a command-line option that will enables it as an alternate code path 1brc --io MM. I did nothing special to avoid munmap and empirically the results of the naive memory mapping were a bit worse than reading from a file handle.

buybackoff commented 8 months ago

@noahfalk I've updated the tables and twitted.

lehuyduc commented 8 months ago

@noahfalk Hi, could you test your result on https://cloud.vast.ai/ , with this specific machine? It has the same spec as the official server. I have tested your code with dotnet publish -c release -r linux-x64, nearly 20 different configs total (different number of threads, mmap vs RandomAccess, default vs 10K dataset). The fastest is

./1brc measurements.txt --threads 32

At ~375ms total. ./1brc measurements.txt is actually slower at 470ms, basically your code is slower with 64 threads (that's why I have a disable hyper threading trick in my code).

~Idk how to run with AOT compilation, so you can -10% from your results.~ Oh, publish means AOT compiled code.

I will push the full result later. But I want another person to run on the same machine to ensure test results are correct Also the results you see on the leaderboard is on the new machine (AX161) which is 2x faster than CCX33 even at 8 threads.

For the dataset, you should use your own file to get the lowest number (I will also use my files to get lowest numbers :rofl: ). But if you want to test my datasets:

image

lehuyduc commented 8 months ago

https://github.com/gunnarmorling/1brc/discussions/138#

I have added your results here. Also in my repo at benchmark_results/other_noahfalk.txt. Could you run my code again? Thanks!

Also at this point which CPU/dataset is used can make a big difference. So if you send me the datasets you use I can test more and have more data points.

lehuyduc commented 8 months ago

@buybackoff Btw, could you help us test performance at different threads to be safe? For me it's run.sh 12 12 and for @noahfalk it's --threads 6. In my test on EPYC 7502 his code can be ~30% faster without hyper threading. For my code hyper threading makes it faster on some CPUs and slower on others. Thanks!

buybackoff commented 8 months ago

@lehuyduc will do later. BTW, my actual CPU is i5-13500 with 14 cores / 20 threads, but a hybrid one with 6x2 performance HT cores and 8 efficiency cores. How would you test on that one? So far I avoided the hybrid side of it and used only P-cores. The results on all cores have a high sigma.

buybackoff commented 8 months ago

@lehuyduc your latest default runs on my machine with 12/12 and 12/6

image

The 6 vs 12 difference is visible on 10K runs:

12: image

6: image

lehuyduc commented 8 months ago

my actual CPU is i5-13500 with 14 cores / 20 threads,

I haven't used any hybrid CPUs so I don't know how to test properly on those. I guess we just lock the process on P-cores using taskset or numactl

numactl --physcpubind=0-11 ./main
taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 ./main

Or you can disable efficiency cores completely: https://unix.stackexchange.com/questions/686459/disable-intel-alder-lake-efficiency-cores-on-linux

buybackoff commented 8 months ago

@lehuyduc I do locking at LXC level: lxc.cgroup2.cpuset.cpus: 0-11

BTW, I have just realized you are still the fastest on 10K on my machine with your latest changes.

lehuyduc commented 8 months ago

Thanks for the results! Hmm, the difference between version 22 and 23 looks much smaller here than my tests on AMD CPUs, not sure why. Anyway, more data points is good!

Oh, can you send me the measurements.txt and measurements_10k.txt file you used? Maybe that's the reason. It's my final version, so now I just want to benchmark different datasets instead of changing anything. Thanks!

noahfalk commented 8 months ago

While I update my numbers in a proper way, could you, please, run my latest commit and Java's artsiomkorzun/abeobk and update your table?

Here are the results I saw: https://gist.github.com/noahfalk/67a3bb32af4d8aa5cb672a25f7b4c6ab

Overall my results for the Java contenders looked 5-10% faster than Gunnar's official results. I'll add your entry to my table in the README if you want me to? If you don't mind I'm going to refrain from adding the Java entries unless their owners specifically ask me to do so. Instead I'm just going to update my README to mention my machine appears to be skewing a little fast and refer to your results where its easier to see everything run on the same hardware.

@noahfalk Hi, could you test your result on https://cloud.vast.ai/ , with this specific machine? It has the same spec as the official server.

I'll see whats involved :)

Idk how to run with AOT compilation, so you can -10% from your results.

If you ran it using the instructions from the README you did get the AOT version:

cd noahfalk_1brc/1brc/
dotnet publish -c release -r linux-x64
cd bin/release/net8.0/linux-x64/publish/
hyperfine -w 1 -r 5 -- "./1brc ~/git/1brc_data/measurements.txt"

If you ran a 1brc executable that wasn't in the publish directory you may have gotten a jitted version instead.

Also the results you see on the leaderboard is on the new machine (AX161) which is 2x faster than CCX33 even at 8 threads.

Haha I missed that they changed the machine... sigh :) Thankfully the empirical results on a few of the Java entries looked close(ish) on my machine vs. the official so it certainly doesn't appear to be 2x faster?

noahfalk commented 8 months ago

For the dataset, you should use your own file to get the lowest number

So far I've always used the dataset produced by create_measurements.sh and create_measurements3.sh. I assumed those were deterministic generators, are they not? I didn't think input selection was even a variable!

lehuyduc commented 8 months ago

So far I've always used the dataset produced by create_measurements.sh and create_measurements3.sh. I assumed those were deterministic generators, are they not? I didn't think input selection was even a variable!

Nope, they're not deterministic. So everyone actually optimize their hash table constants using their own datasets :rofl: The difference between a lucky and unlucky dataset can be 10% or more hash collision!

If you ran it using the instructions from the README you did get the AOT version:

Oh I followed the README. That's good to hear.

Thankfully the empirical results on a few of the Java entries looked close(ish) on my machine vs. the official so it certainly doesn't appear to be 2x faster?

So at the time of the change, everyone's results on the leaderboard became ~2x faster. Maybe after many more rounds of optimization, the difference has become smaller. Edit: oh, I forgot that performance of the same instance can be wildly different based on which data center you get assigned to (the judge actually found this earlier). So I guess you got assigned an instance with a new CPU.

lehuyduc commented 8 months ago

@noahfalk I've added another benchmark of your code for you, this time on 10980XE (18c36t). Now your code is faster with hyperthreading, while on EPYC 7502 it was slower. The results are much slower than on EPYC 7502 though, likely because the machine has average bandwidth (23 GB / s using my test). See other_noahfalk_10980xe.txt and other_noahfalk_10980xe_10k.txt. Also check out other_noahfalk.txt if you haven't.

Another thing I found is that in the 10K dataset, my code is much faster on AMD. Not sure if it's slower just on this CPU or on Intel in general (all my code was tested on AMD, so it's understandable). Below is comparison with 2950X 16c32t at 24GB/s RAM bandwidth. Note how on 10980XE both our code take ~1.4x second for 10K dataset, but then on EPYC 7502 my code is 2.55x faster while yours is 1.47x faster.

Name/Test Original, 1 thread Original, 8 thread Original, 32/36 thread 10K, 1 thread 10K, 8 thread 10K, 32/36 thread
Intel 10980XE 10.231s 1.515s 0.592s 15.876s 2.491s 1.415s
AMD 2950X 7.972s 1.075s 0.446s 11.103s 1.537s 1.022s
EPYC 7502 10.720s 1.358s 0.374s 14.095s 1.810s 0.554s
10980XE (noah) 7.663s 1.086s 0.553s 19.211s 2.663s 1.463s
EPYC 7502 (noah) 7.060 926.8 390.9 20.882 2.719 0.994s

Also 2950X and 10980XE were released around the same time, and are in the same performance class, so it shouldn't be this different. I'm not sure what causes this.

Benchmarking is hard 😢 Especially when your code will be run on totally different hardware

noahfalk commented 8 months ago

Thanks @lehuyduc, you've been much more diligent at benchmarking than I! Agreed that hardware seems to impact the results considerably.

One potential reason for some of the different results is I see the 10980XE has 1MB of L2 cache per core whereas the EPYC only has 512KB. I wouldn't be surprised if my entry on the 10K dataset accesses more than 512KB of working memory whereas your entry might be more efficient and stay inside that envelope.

If I get a chance to do some deeper analysis I'll certainly let you know what I find :)

noahfalk commented 8 months ago

Oh almost forgot to mention, I did update my code last night with a slightly more helpful --timings option. I don't expect it should have any meaningful change in performance, but it might help shed more light on single-threaded startup/shutdown overheads vs parallel performance. No need for you to do anything with it of course, but if you are curious its there.

lehuyduc commented 8 months ago

@noahfalk Last benchmark from me until the end of the contest because I'll be on vacation :laughing: Also if you have Twitter, could you share the results? I really don't like using Twitter.

Dual EPYC 9354, 1TB DDR5. Bandwidth = 1.4585e+11 byte/s, compared to EPYC 7502 Bandwidth = 5.79936e+10 byte/s

For some reasons, your code plateau hard around 32 threads in the 10K dataset. I have included runs with --timings so you can try to find out why. But if you just want to get smallest number possible on 1-16 thread, then you should find a 13900K. I couldn't find any.

image

buybackoff commented 7 months ago

Hopefully, over the next weekend (starting from tomorrow evening) I will have access to big last-gen bare metal AMD and Intel machines (Ubuntu 22). If not, I was going to use AWS metal anyways. In an ideal world I may do all of that, if a weekend is enough timewise.

@lehuyduc what are the exact commands you use for these 1/../128 results?

And for all @lehuyduc @noahfalk @nietras and others who read this, what would you suggest to pay attention to and best practices to test? I'm going to use numactl as upstream does and just see how it goes with different core count, HT and Turbo one/off. It would be nice to develop a reproducible methodology or just to discuss that.

buybackoff commented 7 months ago

I will also publish the exact datasets I've been using. I do remember @lehuyduc asked for them couple of days ago. I have uploaded them already on S3, just need to configure the sharing part.

lehuyduc commented 7 months ago

@lehuyduc what are the exact commands you use for these 1/../128 results?

Previously, I used hyperfine --warmup 1 --runs 10 './main measurements_10k.txt' for my code. For noahfalk's code, it's

hyperfine -w 1 -r 5 -- "./1brc /root/1brc-simd/measurements_10k.txt --threads 32 > log.txt

But for the latest numbers above I just ./main measurements_10k.txt ~10 times and get the minimum result. Same with other people's code. Definitely not scientific, but it's more fun, also it gives everybody better numbers. As you can see in the image below, the results are more than stable enough.

image

lehuyduc commented 7 months ago

numactl can be a problem. If we use numactl --physcpubind=0-7 to test on a 32c64t PC, everyone's code will try to use 32/64 threads, which is very inefficient when there're only 8 cores. For example, I found that @artsi solution can be a lot slower: 2.4s when compiling manually with 8 threads then run, but 3s if numactl --physcpubind=0-7 is used.

So I think if the code supports number of threads as an argument, we should use that. If not then use numactl, which will give them performance penalty (but we don't have time to manually change everyone's code so it's fair).

buybackoff commented 7 months ago

numactl can be a problem. If we use numactl --physcpubind=0-7 to test on a 32c64t PC, everyone's code will try to use 32/64 threads, which is very inefficient when there're only 8 cores.

Interestingly, how Java deals with that? Given that most of the implementations just use system's processor count. I though/hoped numactl somehow makes other CPUs invisible.

In that case I may go LXC way again. It's just a single line to create whatever sub-CPU config I need. I may even be able to export an image and put it on R2 for the exact reproducibility.

buybackoff commented 7 months ago

I've just checked and Environment.ProcessorCount works fine with numactl physcpubind=0-7, it correctly detects only 8 CPUs. The only issue is native code recompilation.

lehuyduc commented 7 months ago

Interestingly, how Java deals with that?

They don't, basically on the Java leaderboard everyone is suffering some performance penalty.

buybackoff commented 7 months ago

But for the latest numbers above I just ./main measurements_10k.txt ~10 times and get the minimum result. Same with other people's code. Definitely not scientific, but it's more fun, also it gives everybody better numbers. As you can see in the image below, the results are more than stable enough.

@lehuyduc BTW, I also do check hyperfine output with my eyes, when it's not very stable (e.g. switch from default to 10K and page cache is not good yet) I just re-run until sigma is low and stable. But since I report min/max/sigma in the full results I believe the average on a quiet machine is a better metric. Because it's not only CPU bound but RAM bandwidth bound especially on many cores. Some fluctuations are expected even with frozen CPU frequency. Also JIT runs are visibly less stable than AOT and sigma is an interesting metric itself.

buybackoff commented 7 months ago

Hello @lehuyduc

Could you please check how I run you code from bash scripts here: https://github.com/buybackoff/1brc-bench/blob/main/run.sh#L76

We discussed how to run with 12 12 vs 12 6, but I thing the relevant config was 12 12 vs 6 6, so on HT use every core. Then I see the difference.

For you code I always the number of threads as both parameters. It makes it much faster on my default machine: 2.6 sec with 12 12 as was published vs 2.95 with 6 6.

I've created a repo with a bunch of scripts and now going to run it on big bare metal: https://github.com/buybackoff/1brc-bench/

lehuyduc commented 7 months ago

@buybackoff oh, I'm not back from my vacation yet so I can't make those changes at the moment. You can use whichever best number you get, thanks! It's the last day anyway :D