Closed noahfalk closed 9 months ago
This is an awesome next level! How is that even possible at all!? 😄
You left everyone so far behind! This result is comparable to the Default/JIT column in my tables in the blog post. While I update my numbers in a proper way, could you, please, run my latest commit and Java's artsiomkorzun/abeobk and update your table? That will establish some baseline for your machine and will allow for relative scaling and transitivity check. An additional independent run will be very helpful.
@lehuyduc FIY
I really need to look into parsing multiple temperatures at once :sob: But wait a bit, a patch is coming that should make the gap closer
I really need into parsing multiple temperatures at once 😠But wait a bit, a patch is coming that should make the gap closer
I'm looking forward to it! I don't see why you wouldn't be able to replicate anything my solution is doing that is useful and get a comparable (or even better!) number.
This is an awesome next level! How is that even possible at all!? 😄
Haha, thanks. In part I suspect the answer is being willing to chase the rabbit way down the hole and write a bunch of code : )
could you, please, run my latest commit and Java's artsiomkorzun/abeobk and update your table?
Sure, that will probably be later tonight before I can get that set up.
Btw, how do you handle munmap
? I think all Java solutions use subprocess and exit after the result is printed to basically skip through munmap
. Edit: oh nvm, you use file reading completely
@lehuyduc I've just experimented with subprocess, I see an improvement, but it's much less than another approach: on every thread I split the big per-thread chunks into two: a bigger one 50% + (thread_id/num_threads) * 49%
and what's left. Every bigger chunks is unmapped at a different time and does not cause a lock for other threads. And smaller chunks do the same and they are much smaller and faster to unmap.
Edit: oh nvm, you use file reading completely
Yep. If you do want to see how my entry performs with memory mapping there is a command-line option that will enables it as an alternate code path 1brc --io MM
. I did nothing special to avoid munmap and empirically the results of the naive memory mapping were a bit worse than reading from a file handle.
@noahfalk I've updated the tables and twitted.
@noahfalk Hi, could you test your result on https://cloud.vast.ai/ , with this specific machine? It has the same spec as the official server. I have tested your code with dotnet publish -c release -r linux-x64
, nearly 20 different configs total (different number of threads, mmap vs RandomAccess, default vs 10K dataset). The fastest is
./1brc measurements.txt --threads 32
At ~375ms total. ./1brc measurements.txt
is actually slower at 470ms, basically your code is slower with 64 threads (that's why I have a disable hyper threading trick in my code).
~Idk how to run with AOT compilation, so you can -10% from your results.~ Oh, publish
means AOT compiled code.
I will push the full result later. But I want another person to run on the same machine to ensure test results are correct Also the results you see on the leaderboard is on the new machine (AX161) which is 2x faster than CCX33 even at 8 threads.
For the dataset, you should use your own file to get the lowest number (I will also use my files to get lowest numbers :rofl: ). But if you want to test my datasets:
cat measurements.txt >> measurements_10k.txt
5 times to get 1B lineshttps://github.com/gunnarmorling/1brc/discussions/138#
I have added your results here. Also in my repo at benchmark_results/other_noahfalk.txt
. Could you run my code again? Thanks!
Also at this point which CPU/dataset is used can make a big difference. So if you send me the datasets you use I can test more and have more data points.
@buybackoff Btw, could you help us test performance at different threads to be safe? For me it's run.sh 12 12
and for @noahfalk it's --threads 6
. In my test on EPYC 7502 his code can be ~30% faster without hyper threading. For my code hyper threading makes it faster on some CPUs and slower on others. Thanks!
@lehuyduc will do later. BTW, my actual CPU is i5-13500 with 14 cores / 20 threads, but a hybrid one with 6x2 performance HT cores and 8 efficiency cores. How would you test on that one? So far I avoided the hybrid side of it and used only P-cores. The results on all cores have a high sigma.
@lehuyduc your latest default runs on my machine with 12/12 and 12/6
The 6 vs 12 difference is visible on 10K runs:
12:
6:
my actual CPU is i5-13500 with 14 cores / 20 threads,
I haven't used any hybrid CPUs so I don't know how to test properly on those. I guess we just lock the process on P-cores using taskset
or numactl
numactl --physcpubind=0-11 ./main
taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 ./main
Or you can disable efficiency cores completely: https://unix.stackexchange.com/questions/686459/disable-intel-alder-lake-efficiency-cores-on-linux
@lehuyduc I do locking at LXC level: lxc.cgroup2.cpuset.cpus: 0-11
BTW, I have just realized you are still the fastest on 10K on my machine with your latest changes.
Thanks for the results! Hmm, the difference between version 22
and 23
looks much smaller here than my tests on AMD CPUs, not sure why. Anyway, more data points is good!
Oh, can you send me the measurements.txt
and measurements_10k.txt
file you used? Maybe that's the reason. It's my final version, so now I just want to benchmark different datasets instead of changing anything. Thanks!
While I update my numbers in a proper way, could you, please, run my latest commit and Java's artsiomkorzun/abeobk and update your table?
Here are the results I saw: https://gist.github.com/noahfalk/67a3bb32af4d8aa5cb672a25f7b4c6ab
Overall my results for the Java contenders looked 5-10% faster than Gunnar's official results. I'll add your entry to my table in the README if you want me to? If you don't mind I'm going to refrain from adding the Java entries unless their owners specifically ask me to do so. Instead I'm just going to update my README to mention my machine appears to be skewing a little fast and refer to your results where its easier to see everything run on the same hardware.
@noahfalk Hi, could you test your result on https://cloud.vast.ai/ , with this specific machine? It has the same spec as the official server.
I'll see whats involved :)
Idk how to run with AOT compilation, so you can -10% from your results.
If you ran it using the instructions from the README you did get the AOT version:
cd noahfalk_1brc/1brc/
dotnet publish -c release -r linux-x64
cd bin/release/net8.0/linux-x64/publish/
hyperfine -w 1 -r 5 -- "./1brc ~/git/1brc_data/measurements.txt"
If you ran a 1brc executable that wasn't in the publish directory you may have gotten a jitted version instead.
Also the results you see on the leaderboard is on the new machine (AX161) which is 2x faster than CCX33 even at 8 threads.
Haha I missed that they changed the machine... sigh :) Thankfully the empirical results on a few of the Java entries looked close(ish) on my machine vs. the official so it certainly doesn't appear to be 2x faster?
For the dataset, you should use your own file to get the lowest number
So far I've always used the dataset produced by create_measurements.sh and create_measurements3.sh. I assumed those were deterministic generators, are they not? I didn't think input selection was even a variable!
So far I've always used the dataset produced by create_measurements.sh and create_measurements3.sh. I assumed those were deterministic generators, are they not? I didn't think input selection was even a variable!
Nope, they're not deterministic. So everyone actually optimize their hash table constants using their own datasets :rofl: The difference between a lucky and unlucky dataset can be 10% or more hash collision!
If you ran it using the instructions from the README you did get the AOT version:
Oh I followed the README. That's good to hear.
Thankfully the empirical results on a few of the Java entries looked close(ish) on my machine vs. the official so it certainly doesn't appear to be 2x faster?
So at the time of the change, everyone's results on the leaderboard became ~2x faster. Maybe after many more rounds of optimization, the difference has become smaller. Edit: oh, I forgot that performance of the same instance can be wildly different based on which data center you get assigned to (the judge actually found this earlier). So I guess you got assigned an instance with a new CPU.
@noahfalk I've added another benchmark of your code for you, this time on 10980XE (18c36t). Now your code is faster with hyperthreading, while on EPYC 7502 it was slower. The results are much slower than on EPYC 7502 though, likely because the machine has average bandwidth (23 GB / s using my test). See other_noahfalk_10980xe.txt and other_noahfalk_10980xe_10k.txt
. Also check out other_noahfalk.txt
if you haven't.
Another thing I found is that in the 10K dataset, my code is much faster on AMD. Not sure if it's slower just on this CPU or on Intel in general (all my code was tested on AMD, so it's understandable). Below is comparison with 2950X 16c32t at 24GB/s RAM bandwidth. Note how on 10980XE both our code take ~1.4x second for 10K dataset, but then on EPYC 7502 my code is 2.55x faster while yours is 1.47x faster.
Name/Test | Original, 1 thread | Original, 8 thread | Original, 32/36 thread | 10K, 1 thread | 10K, 8 thread | 10K, 32/36 thread |
---|---|---|---|---|---|---|
Intel 10980XE | 10.231s | 1.515s | 0.592s | 15.876s | 2.491s | 1.415s |
AMD 2950X | 7.972s | 1.075s | 0.446s | 11.103s | 1.537s | 1.022s |
EPYC 7502 | 10.720s | 1.358s | 0.374s | 14.095s | 1.810s | 0.554s |
10980XE (noah) | 7.663s | 1.086s | 0.553s | 19.211s | 2.663s | 1.463s |
EPYC 7502 (noah) | 7.060 | 926.8 | 390.9 | 20.882 | 2.719 | 0.994s |
Also 2950X and 10980XE were released around the same time, and are in the same performance class, so it shouldn't be this different. I'm not sure what causes this.
Benchmarking is hard 😢 Especially when your code will be run on totally different hardware
Thanks @lehuyduc, you've been much more diligent at benchmarking than I! Agreed that hardware seems to impact the results considerably.
One potential reason for some of the different results is I see the 10980XE has 1MB of L2 cache per core whereas the EPYC only has 512KB. I wouldn't be surprised if my entry on the 10K dataset accesses more than 512KB of working memory whereas your entry might be more efficient and stay inside that envelope.
If I get a chance to do some deeper analysis I'll certainly let you know what I find :)
Oh almost forgot to mention, I did update my code last night with a slightly more helpful --timings option. I don't expect it should have any meaningful change in performance, but it might help shed more light on single-threaded startup/shutdown overheads vs parallel performance. No need for you to do anything with it of course, but if you are curious its there.
@noahfalk Last benchmark from me until the end of the contest because I'll be on vacation :laughing: Also if you have Twitter, could you share the results? I really don't like using Twitter.
Dual EPYC 9354, 1TB DDR5. Bandwidth = 1.4585e+11 byte/s
, compared to EPYC 7502
Bandwidth = 5.79936e+10 byte/s
For some reasons, your code plateau hard around 32 threads in the 10K dataset. I have included runs with --timings
so you can try to find out why. But if you just want to get smallest number possible on 1-16 thread, then you should find a 13900K. I couldn't find any.
Hopefully, over the next weekend (starting from tomorrow evening) I will have access to big last-gen bare metal AMD and Intel machines (Ubuntu 22). If not, I was going to use AWS metal anyways. In an ideal world I may do all of that, if a weekend is enough timewise.
@lehuyduc what are the exact commands you use for these 1/../128
results?
And for all @lehuyduc @noahfalk @nietras and others who read this, what would you suggest to pay attention to and best practices to test? I'm going to use numactl
as upstream does and just see how it goes with different core count, HT and Turbo one/off. It would be nice to develop a reproducible methodology or just to discuss that.
I will also publish the exact datasets I've been using. I do remember @lehuyduc asked for them couple of days ago. I have uploaded them already on S3, just need to configure the sharing part.
@lehuyduc what are the exact commands you use for these 1/../128 results?
Previously, I used hyperfine --warmup 1 --runs 10 './main measurements_10k.txt'
for my code. For noahfalk's code, it's
hyperfine -w 1 -r 5 -- "./1brc /root/1brc-simd/measurements_10k.txt --threads 32 > log.txt
But for the latest numbers above I just ./main measurements_10k.txt
~10 times and get the minimum result. Same with other people's code. Definitely not scientific, but it's more fun, also it gives everybody better numbers. As you can see in the image below, the results are more than stable enough.
numactl
can be a problem. If we use numactl --physcpubind=0-7
to test on a 32c64t PC, everyone's code will try to use 32/64 threads, which is very inefficient when there're only 8 cores. For example, I found that @artsi solution can be a lot slower: 2.4s when compiling manually with 8 threads then run, but 3s if numactl --physcpubind=0-7
is used.
So I think if the code supports number of threads as an argument, we should use that. If not then use numactl
, which will give them performance penalty (but we don't have time to manually change everyone's code so it's fair).
numactl can be a problem. If we use numactl --physcpubind=0-7 to test on a 32c64t PC, everyone's code will try to use 32/64 threads, which is very inefficient when there're only 8 cores.
Interestingly, how Java deals with that? Given that most of the implementations just use system's processor count. I though/hoped numactl
somehow makes other CPUs invisible.
In that case I may go LXC way again. It's just a single line to create whatever sub-CPU config I need. I may even be able to export an image and put it on R2 for the exact reproducibility.
I've just checked and Environment.ProcessorCount
works fine with numactl physcpubind=0-7
, it correctly detects only 8 CPUs. The only issue is native code recompilation.
Interestingly, how Java deals with that?
They don't, basically on the Java leaderboard everyone is suffering some performance penalty.
But for the latest numbers above I just ./main measurements_10k.txt ~10 times and get the minimum result. Same with other people's code. Definitely not scientific, but it's more fun, also it gives everybody better numbers. As you can see in the image below, the results are more than stable enough.
@lehuyduc BTW, I also do check hyperfine output with my eyes, when it's not very stable (e.g. switch from default to 10K and page cache is not good yet) I just re-run until sigma is low and stable. But since I report min/max/sigma in the full results I believe the average on a quiet machine is a better metric. Because it's not only CPU bound but RAM bandwidth bound especially on many cores. Some fluctuations are expected even with frozen CPU frequency. Also JIT runs are visibly less stable than AOT and sigma is an interesting metric itself.
Hello @lehuyduc
Could you please check how I run you code from bash scripts here: https://github.com/buybackoff/1brc-bench/blob/main/run.sh#L76
We discussed how to run with 12 12
vs 12 6
, but I thing the relevant config was 12 12
vs 6 6
, so on HT use every core. Then I see the difference.
For you code I always the number of threads as both parameters. It makes it much faster on my default machine: 2.6 sec with 12 12
as was published vs 2.95 with 6 6
.
I've created a repo with a bunch of scripts and now going to run it on big bare metal: https://github.com/buybackoff/1brc-bench/
@buybackoff oh, I'm not back from my vacation yet so I can't make those changes at the moment. You can use whichever best number you get, thanks! It's the last day anyway :D
Hi @buybackoff - thanks for the great article and running this unofficial leaderboard. Would you mind including my entry in your run? https://github.com/noahfalk/1brc/
Its quite speedy from my own benchmarking and I hope the results will be reproducible for you. I put some more info in the README on my repo but if you have any questions about it feel free to reach out. Thanks!