buybackoff / 1brc

1BRC in .NET among fastest on Linux
https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-fastest-on-linux-my-optimization-journey/
MIT License
437 stars 43 forks source link

Last refresh #19

Closed pedrosakuma closed 7 months ago

pedrosakuma commented 7 months ago

Hello Victor,

I know is very close to the official cut time. I submitted a PR on 1brc-bench with my repository.

Thanks in advance.

buybackoff commented 7 months ago

I know is very close to the official cut time.

Actually, there is no official cut off time for me. That even sounds weird to me. I'm not affiliated with the Java competition. And they were unwelcoming enough to make me not even trying. What I liked about this challenge is that no one just blindly copied others' solutions and everyone kept their original designs to a big extent. For me the main constraint was that I understood the code and given lot's of free time I had a chance to come up with something similar. Or something obvious that I just forgot. These things I could reuse. But never I would reuse SIMD simultaneous 4-number parsing by @noahfalk and similar stuff.

So my cut off time is dictated by my own time constraints and availability of big machines for the final benchmarking. I'm going to a 3-days vacation tomorrow evening, but I will have time during the day tomorrow for extensive benchmarking. Feel free to update your solution, let's say before 14PM UTC tomorrow, if it's not just verbatim copy of someone's code. Same for @kpocza. Maybe even later. But I may lose access to the big machines after that vacation.

I'm still waiting when @RagnarGrootKoerkamp updates his fastest-Rust but not compliant code, the comparison would be very interesting anyway.

And @austindonisan has had some issues with 10K dataset. I had to disable it 2 commits ago (tweak hash sizes... was failing on 4 cores). It looked OK as of the last commit yesterday for some combination of cores/threads, but since it was disabled I could not run it through all combinations.

buybackoff commented 7 months ago

Two more things. Since this issue is called "Last refresh", some final touches remain.

I include repos as submodules here: https://github.com/buybackoff/1brc-bench/tree/main/src

I'm going to run many benchmarks tomorrow on 2 AMD / 2 Intel machines and AWS m6/7g.metal Graviton. I have two issues.

  1. License. @lehuyduc @austindonisan @dzaima @RagnarGrootKoerkamp @CameronAavik @kpocza @NimaAra @mtopolnik (Rust) do not have any license. It would be nice if every submodule in my repo had one of MIT/Apache/BSD. I may be reluctant to keep a proprietary submodule there, even if I understand that probably the intent was FOSS from the beginning.

  2. ARM compatibility. I believe native and some .NET solutions are out (FYI @xoofx and @nietras it looks like you are fine on M1/2 Macs given the benchmarks Alexandre did on M1 Pro. Should I expect this to work on Gravitons? Or that was Rosetta emulation?). I use only Vector256 API and that should be accelerated on ARM as well. I'm not sure about Rust, but from a quick search it looks like it should work. Anyway if you do not use anything x64/x86 specific and could just use x-arch Vector or other API that would be nice. Since Java does not have easy access to SIMD and they had to reinvent the wheel with parallel int64 reads I expect most Java solutions will just work on ARM, that's a positive thing of a constrained environment. I would love to have more things to compare in ARM.

dzaima commented 7 months ago

Added a license to my repo (forgot to do so before!).

I could probably get an aarch64 port made without too much trouble, though we'll have to see if I get around to it today, tomorrow, or ever.

noahfalk commented 7 months ago

But never I would reuse SIMD simultaneous 4-number parsing by @noahfalk and similar stuff.

In case you were worried on my behalf, I'm happy to have anyone reuse anything that appears useful from my entry :) Of course as the curator of these cross-language 1BRC results its totally your call on how distinctive you want each entry to be before you decide to include it.

NimaAra commented 7 months ago

I have added the MIT license now on master.

buybackoff commented 7 months ago

In case you were worried on my behalf, I'm happy to have anyone reuse anything that appears useful from my entry :) Of course as the curator of these cross-language 1BRC results its totally your call on how distinctive you want each entry to be before you decide to include it.

@noahfalk I was talking for myself. Frankly, I did not have time to grok your solution in every detail, especially the SIMD 4-in-1 number parsing. But I could have copied your main loop and adapt it to my other parts without groking its every detail. It would be of zero interest to me and would bring zero fun. Actually on many cores, e.g. 64+, we are sometimes becoming competitive. You are consuming memory so aggressively on each core, so probably (my ungrounded theory) too many cores do not help.

buybackoff commented 7 months ago

@noahfalk You use AVX directly a lot. Do you think it's feasible to just use Vector256? So that we may have a chance for reasonable perf on ARM?

image

noahfalk commented 7 months ago

You are consuming memory so aggressively on each core, so probably (my ungrounded theory) too many cores do not help

Yeah, I could easily believe that. I never tested or tuned it for operation above 8 threads so it makes sense to me that it doesn't do as well there.

Do you think it's feasible to just use Vector256? So that we may have a chance for reasonable perf on ARM?

Some of those places probably are replaceable with Vector256 without changing the generated code, but unfortunately not all of them. In particular:

Probably the easiest option is to exclude my entry from any ARM benchmarks with a note that it used architecture specific intrinsics. If anyone wanted to try porting my approach to ARM I'm happy to have them do that but I don't plan to tackle it myself.

kpocza commented 7 months ago

License added.

ARM support is a good question for me. If the tests fail, I can check it on my Raspberry Pi if it supports .NET 8.

There is an extension for VS, which can show geneeated assembly code at function level. You may use it to check what Avx2 is generated for certain Vector operations (and other parts as well).

mtopolnik commented 7 months ago

I added the license as well.

austindonisan commented 7 months ago

I added a license, too. Some last minute improvements, too, so please pull the latest copy.

I'm pretty confident it's bugfree and should work on the 10K data set now. The 10k set is only taking 2.4x as long now which might make it competitive?

I don't expect it to work on ARM (well at least). My entire codebase is centered around 256bit vectors.

buybackoff commented 7 months ago

@NimaAra your code does not work in the bench setup, probably because the input file is a symlink there. See @kpocza fix for that: https://github.com/kpocza/1brc/commit/9ccfe6a0933e5a33cc76cd7122c5e7534e03643e

nietras commented 7 months ago

@xoofx and @nietras it looks like you are fine on M1/2 Macs given the benchmarks Alexandre did on M1 Pro. Should I expect this to work on Gravitons? Or that was Rosetta emulation?

Should work fine on ARM. No simulation. Most cross-platform Vector256 (which is impl on ARM as 2 x Vector128). My impl has fallback when no Avx2 and only used for a specific hash optimization. On ARM likely not run as well as not optimized particularly for that.

NimaAra commented 7 months ago

Symlink support added to master albeit not tested.

buybackoff commented 7 months ago

Done! Try to digest 2540 files now :) https://github.com/buybackoff/1brc-bench/commit/41fb2bb8e42f5d54f298f684bdd820419841bd10

dzaima commented 7 months ago

Some basic formatting (mean time): original dataset, 10K

dzaima commented 7 months ago

@buybackoff with your added sudo to all the numactls in run.sh, I think the THREADS_1BRC= and THREADS= envvars don't make it to their respective programs? You'd need sudo -E or similar.

NimaAra commented 7 months ago

Interesting, @dzaima Is that showing that for 10k my "Safe" code is out-performing other solutions? That can't be right or am I reading this wrong?

buybackoff commented 7 months ago

@NimaAra I re-ran your code and then updated the files but obviously something old remained there.

buybackoff commented 7 months ago

@buybackoff with your added sudo to all the numactls in run.sh, I think the THREADS_1BRC= and THREADS= envvars don't make it to their respective programs? You'd need sudo -E or similar.

Right, I was thinking why your results are the same. Shit, it's too late to re-run it now. But anyone could run this on AWS metal themselves.

@dzaima you are the only one affected. How to fix that? I may be able to re-run your and NimaAra results on non-AWS machines.

NimaAra commented 7 months ago

Don't remove it! Let me bask in this empty victory a bit longer...😭

dzaima commented 7 months ago

I'd be fine with just removing non-8-thread timings of my entry within the current results (as my solution defaults to THREADS_1BRC=8).

buybackoff commented 7 months ago

@NimaAra Do not worry, I may start the benches but will not wait for the results today. No time, need to go 🚄⛷️

dzaima commented 7 months ago

Actually, with the numactl-limited data, can keep all thread counts less than or equal to 8; so that's removing all these files, resulting in these staying

(edit: added bad file names in original dataset too)

buybackoff commented 7 months ago

@NimaAra actually you results just should be in the 10K 20 runs benches, I forgot to delete them from there. Other numbers are fine.

buybackoff commented 7 months ago

@dzaima would it be difficult to aggregate by min value? Or throw away top/bottom 5 for 20 runs?

dzaima commented 7 months ago

The raw JSON data has only 3 (for 5 runs) or 5 (for 20 runs) entries in the times arrays, so I'm not exactly sure what can be aggregated from that or what they even represent. But here's the results by the "min" field: 1B, 1B_10K (also excluding my entries on >8-core tests)

dzaima commented 7 months ago

Oh, the run counts for 1B are 5 & 20, but 1B_10K has 3 & 5; the directory names are just wrong then.

buybackoff commented 7 months ago

Thanks! Yes 20 runs are only for the default data. Otherwise too long to wait.

buybackoff commented 7 months ago

There is logic to divide default runs by 4 for 10K, but keep at least 3.

dzaima commented 7 months ago

And here's, for tests with 20 runs, average from without the best 5 & worst 5 times.

kpocza commented 7 months ago

I see different number of participants in the test cases. The maximum is 19 in the previous outputs, while the new tables show only 16-17 rows per test. Even the same machines have different number of participants, like 6c6t has 17, while the rest has only 16. Why not having 19 participants in all new outputs?

buybackoff commented 7 months ago

@kpocza some are very long to wait, especially on 20 runs or 10K. 5 runs (and 3 for 10K) is enough for a big picture, everyone is included there.

dzaima commented 7 months ago

@kpocza I dropped my (dzaima) entries on tests with over 8 threads from my latest tables (as discussed above, it was capped to 8 threads so has meaningless/misleading data above that). That should be the only change in participant counts between my initial tables and the newer ones though. (the rest of the differences are just in the data; the initial tables had a varying number of participants too, just not within one CPU)

austindonisan commented 7 months ago

Thanks for all the numbers!

However, your tests on multi-socket systems (which I think is all except for Intel Raptor Lake) aren't valid. The page cache is split across NUMA nodes, and how that data is loaded is being decided by how the 1st program chunked its data . Later programs then just have to deal with that, loading the file across the CPU interconnect instead of directly from RAM.

To fairly test this you need flush the page cache between each competitor's runs (including every time the core count changes).

Also I'm fairly certain my hyper threaded results are cheating. My coded blindly calls sched_setaffinity() which overrides numactl. I can fix that with some more command line parameters if anybody cares.

The other side of this is that my code was not designed to be run on consumer Intel chips. It doesn't verify the processor layout and assumes that processor numbers are contiguous and not interleaved. So on the Raptor Lake (i5-13500) test 6c/6t is actually 3c6t for my code. I don't care and it's certainly just because I was lazy and not your fault at all.

buybackoff commented 7 months ago

@austindonisan you are already the winner. But could you just help to make this repo truly reproducible? I'm not a Bash expert and do not want to be one, frankly. This repo is reasonable, i believe. I even wanted to exclude results when they go to another socket. But I'm listening...

austindonisan commented 7 months ago

@austindonisan you are already the winner. But could you just help to make this repo truly reproducible? I'm not a Bash expert and do not want to be one, frankly. This repo is reasonable, i believe. I even wanted to exclude results when they go to another socket. But I'm listening...

Multi-socket tests really do suck. I wasted so much time not understanding out why my code was slow when testing on a GCP instances.

For all of these you need to make sure the machine has at least 48GB RAM, so each socket has it's own 24GB.

Limiting tests to a single socket means you wouldn't have to flush to page cache. But you have to make sure that there aren't any badly behaved implementations (e.g. mine) that ignore numactl and will use the socket 2nd when you're just trying to force hyperthreading on the 1st core.

The easier fix is to just drop the caches and reload the file. If it's being run as root it will be a simple as sprinkling in a few "echo 3 > /proc/sys/vm/drop_caches". But then file then has to be read back in from disk, so if this machines have a HDDs that will be quite painful (~90 seconds).

I'll take a look at your test file and see how easy it is to modify.

I'm not too concerned with winning, it's just not very satisfying to see high, effectively arbitrary numbers for everybody at the high core counts that are dominated artificial memory limits (bandwidth is effectively cut 35% for one socket and 75% for the other). The Zen3 (7R1348) tests clearly had bad memory placement for everybody, while the Zen4 (9374F32) test seems to have been good for just my code. Or maybe there's something else going on with AVX-512, but it's impossible to know.

One last comment, it would be nice to include a test with a Zen2 CPU (the contest target). Your AMD tests are only on Zen3 and Zen4.

austindonisan commented 7 months ago

I went through your scripts and it's definitely not trivial to change. You have to rip out the tmpfs which was ok, and clearing the caches before each run above 1/2 max cpu is ok, but it becomes painfully slow refilling the cache each run. Also 1 warm up run isn't sufficient, but that needs to be tweaked in a different file.

Overall I'd say not worth the effort. I did go and fix my program to listen to numactl, but it's unfortunate I didn't do that earlier.

Again thanks again for putting this together! The 10k results were the most surprising so I'm glad you ran those. I 100% prioritized the standard dataset and was expecting my performance on the 10k set to be awful.

buybackoff commented 7 months ago

@austindonisan thank you very much for such deep insights here. From your comments I infer that results on a single socket are OK as is. If there's memory slowdown it's the same for everyone. Is that right?

CameronAavik commented 7 months ago

Have added MIT License to my solution now

austindonisan commented 7 months ago

@austindonisan thank you very much for such deep insights here. From your comments I infer that results on a single socket are OK as is. If there's memory slowdown it's the same for everyone. Is that right?

The tests where everybody's threads were supposed to run on a single CPU might be fair enough. If the servers had at least 128GB of RAM and you ran the tests without any faffing (i.e. didn't load more than 1 copy each of 1B.txt and 1B_10K.txt).

The results are at best bi-modal because of your tmpfs usage. Normally the files would have been loaded into node 0's memory on the first run since the tests were scheduled to run on cores 0-7. But since use_input.sh is doing the the memory loading, without numactl, it's 50-50 which node "cp" will run on and therefore which node the memory will be loaded into.

Additionally, copying a file into tmpfs also loads it into the page cache, duplicating it's memory footprint. That means that the servers needed at least 128GB of RAM to guarantee no weird effects (128GB -> 64GB per node -> ~30GB of text files duplicated in page cache and tmpfs).

There is the problem of the actual binaries being loaded into memory on the wrong node, but I didn't test the size of that effect.

I ran the tests for my code on a 112-core (2 x 56-core) machine explicitly loading the measurements into Node #0, Node#1, and Split (where the file is loaded naturally). The first set of numbers is 1B.txt, the second is 1B_10K.txt.

C T Node 0 Node 1 Split Node 0 Node 1 Split
8 8 617 869   1625 2009  
8 16 488 770   1542 1920  
16 16 323 552   821 1047  
16 32 256 528   860 1032  
32 32 168 330   431 559  
32 64 (missed) 352   574 624  
56 56 146 222   309 354  
56 112 163 236   456 461  
112 112 166 173 89 249 257 209
112 224 189 191 114 374 392 322

Some interesting things:

buybackoff commented 7 months ago

If the servers had at least 128GB of RAM

All multi-sockets machines had at least 192 GB (c5.metal which is the smallest).

But since use_input.sh is doing the the memory loading, without numactl, it's 50-50 which node "cp" will run on and therefore which node the memory will be loaded into.

On non-AWS machines the Bash itself was limited by taskset to the first 32 cores. That's why I had to add --all to numactl.

buybackoff commented 7 months ago

Thanks a lot everyone! I've learned so much from every repo. I hope to find time soon to finish the blog and hopefully highlight some most audacious ideas, like reading 8 rows in parallel per core 😄 The single-core parallelism was the biggest learning outcome for me, especially given Java guys managed to be eventually faster over the last two days just by using the instruction parallelism without SIMD.

Also thanks to @artsiomkorzun. I stopped at some point, but then there was no reason why Java without SIMD should be so much faster, so I got some insights from your code and found some glaring oversights in my code. Yet, even with 10K on Intel that was tight. Why it's so fast on AMD remains a mystery for me 🙄

If we take every run on [AMD/Intel]/Threads_Mix/Dataset and make the fastest one as 100% (or 1 sec) and scale everything and then we average it by threads and then by CPU vendor => the resulting numbers should be quite fair. Default/10K separation should remain. This is what I want to get as the final result.

Good luck, that was fun!

noahfalk commented 7 months ago

Thanks for all the work you put in benchmarking and publishing everything @buybackoff! It was nice to have a place where even the non-Java entries felt like they belonged.

buybackoff commented 3 months ago

I've finalized my blog post (link in the readme).

Raw and averaged/scaled results could be viewed online https://onedrive.live.com/view.aspx?resid=B64A7F98F035FC43%21359680&authkey=!AHgAJ48FKqWaUwE or downloaded from this repo as xlsx.