RagnarGrootKoerkamp / 1brc

33 stars 3 forks source link

Support city names up to 100bytes #2

Closed buybackoff closed 7 months ago

buybackoff commented 10 months ago

https://github.com/gunnarmorling/1brc/blob/main/create_measurements3.sh

image

RagnarGrootKoerkamp commented 10 months ago

Is this one of the tests with long city names? Yes those won't work. Not yet anyway; I may support them eventually but for now I don't really care.

buybackoff commented 10 months ago

The specs say up to 100 bytes. If it does not work with that it's not valid.

RagnarGrootKoerkamp commented 10 months ago

Well I state my assumptions at the top of the readme and blogpost. I'll first push for max performance and may make it pass tests at some later point if I feel like it. Developing is faster if I don't have to care about correctness for now.

buybackoff commented 10 months ago

If it doesn't work, it doesn't matter how fast it doesn't work 😉

It's very convenient to limit an entire line to one AVX vector... Definitely not valid for comparison with other conformant implementations ATM.

RagnarGrootKoerkamp commented 10 months ago

Well, I could fix this. But still then perfect hashing assumes that all cities names appear in some reasonable prefix of the data.

Sure I could make the code pass tests, just by writing an if statement if I see a long city name and then doing completely different code, but that's also lame...

buybackoff commented 10 months ago

Sure I could make the code pass tests, just by writing an if statement if I see a long city name and then doing completely different code, but that's also lame

Please do. At the moment you are around the fastest on the default data. But not conforming to the specs makes the comparison invalid.

RagnarGrootKoerkamp commented 10 months ago

Ok I think this should be fixed now. Could you test again?

(But I wouldn't be surprised if there are some other things that fail. In that case I'll just generate the files myself and test locally.)

tivrfoa commented 10 months ago

Ok I think this should be fixed now. Could you test again?

(But I wouldn't be surprised if there are some other things that fail. In that case I'll just generate the files myself and test locally.)

I just ran your program, and it prints wrong averages.

I'm testing with only 200 million rows, created with create_measurements.sh

Correct is on the left:

image

RagnarGrootKoerkamp commented 10 months ago

Oh indeed. There was a bug in my verification script so I missed this. The factor 2 off is fixed now. I did another optimization that I thought was correct, but it turns out that the temperature for Baghdad now is off by 0.1. I may revert that change at some point if needed.

buybackoff commented 10 months ago

@RagnarGrootKoerkamp could you tell, as a gentleman, if your current version is compliant to the best of your knowledge? Or at least if existing bugs won't affect performance by more than 10s of millis?

If so, you are No 1 with a huge margin. I will trust your word 😄 (but that will be public soon)

RagnarGrootKoerkamp commented 10 months ago

I'm pretty sure it's completely not compliant. I'm pretty sure I could make it pass all tests in a somewhat 'clean' way without loosing too much performance, but even then, I'm violating these rules:

No external library dependencies may be used

I use some small unsignificant utility libraries which is fine IMO, but I also use (my own) crate for perfect hashing. That in itself is quite a large codebase. On the other hand, I'm only using a small part of it and may inline it so I can optimize it for the specific dataset here.

But more importantly:

Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported

I currently rely a lot on the fact that each row contains a random city name. I can't handle cases where, say, a city only starts appearing somewhere halfway through the data. I don't think any of the current testcases tests this, but still. There is a very big difference between passing all tests and having code that can reliably handle any dataset within these constraints.

See also this discussion.

I'm also relying on the fact that cities are uniquely determined by their first and last 8 characters. Thus, cities AAAAAAAAXAAAAAAAA and AAAAAAAAYAAAAAAAA currently are indistinguishable to my code.

Everything is kinda in a complicated state: of course I could also do some detection of the number of cities and their properties, and if they look enough like the eval set choose the fast path, and otherwise choose a slower fallback. It's somewhat unfortunate the eval set is a single file with very specific properties. It would have been better if the time was the sum over a number of completely different types of input, but that is not the case, so I am just squeezing everything I can out of this one eval input file.

buybackoff commented 10 months ago

Oh, alright. I wondered what magic you've applied for 10K dataset that it is as fast...

Understood. Thank you, Sir!

RagnarGrootKoerkamp commented 10 months ago

Ah I found your unofficial leaderboard now :)

I really should try to make my solution a bit more compliant to the point where I think it's fair to include it.

Either way, things are around 30% faster now compared to my last message here, so I'm curious how it performs in your benchmarks compared to @lehuyduc.

You may want to try both -j6 and -j12. In my evals it's actually faster to disable hyperthreading, but given that it is enabled on the machine, probably -j12 is better.

buybackoff commented 10 months ago

That leaderboard is quite stale. It would be great if you could make it work with 10K data by tomorrow mid-day approximately. It would be even greater is the solution remains fast with that data.

RagnarGrootKoerkamp commented 10 months ago

I don't think I'll manage tomorrow afternoon, but maybe I manage to come up with and implement some not-super-intrusive way by the end of the day.

(I also reverted the optimization that gave wrong rounded output. It doesn't really matter enough for performance after all.)

tivrfoa commented 10 months ago

@RagnarGrootKoerkamp could you tell, as a gentleman, if your current version is compliant to the best of your knowledge? Or at least if existing bugs won't affect performance by more than 10s of millis?

If so, you are No 1 with a huge margin. I will trust your word 😄 (but that will be public soon)

There are new fastest versions in 1brc. Fastest Java is now 00:02.575

I would also love to see how this Rust compares to the Java ones, but following the same rules. lehuyduc cpp solution follows the rules too.

buybackoff commented 10 months ago

Oh, how do you do that!?

I see multiple commit late yesterday. Are we compliant yet?

RagnarGrootKoerkamp commented 10 months ago

Nope not compliant yet. I'll let you know.

The speedup yesterday is mostly from much better ILP by processing multiple entries in parallel. Instructions per cycle went from 2.2 to 3.

buybackoff commented 10 months ago

Look what I've done here: https://hotforknowledge.com/2024/01/13/7-1brc-in-dotnet-even-faster-than-java-cpp/

You would be the first overall.

RagnarGrootKoerkamp commented 10 months ago

Cool 🤩 competition is pretty close actually, so hopefully making it compliant isn't too costly 😅

Just a small note that for compiled languages it really isn't about 'rust is faster than c++'. Any code i write could also be written in c++ and pretty surely i could reimplement the c++ code in rust, and they would perform similarly.

lehuyduc commented 10 months ago

@buybackoff Thanks for the benchmark! Just want to check, what commit did you use for my code?

buybackoff commented 10 months ago

@lehuyduc It's 3c42a14a9902a13b4e90cbb5bda2efa9602cc0aa

The table shows in the top-left corner the cut-off time for all repos, except for nietras who published his results when I was proofreading.

buybackoff commented 10 months ago

Just a small note that for compiled languages it really isn't about 'rust is faster than c++'. Any code i write could also be written in c++ and pretty surely i could reimplement the c++ code in rust, and they would perform similarly.

I bet .NET could handle most of your tricks. But your know it's a catchy flame-war igniting thing to compare languages.

lehuyduc commented 10 months ago

@buybackoff I'd like to have a request if allowed. Later when @RagnarGrootKoerkamp submit the compliant version, can you run our programs with hyper threading off? I want to investigate HT performance on Intel vs AMD, so another data point would be great. His code is especially interesting because it seems to perform better without HT. Thanks!

RagnarGrootKoerkamp commented 10 months ago

Btw, did you run my code on the 10k city names input at all? I do think it should work. (Not on pc now so can't generate and test myself.) If it works I'm curious how fast it is

buybackoff commented 10 months ago

Btw, did you run my code on the 10k city names input at all? I do think it should work. (Not on pc now so can't generate and test myself.) If it works I'm curious how fast it is

Yes I tried. I even mentioned that above:

Oh, alright. I wondered what magic you've applied for 10K dataset that it is as fast...

It was almost as fast as the default, which was suspicious. So I asked you a question, several time. I link to this discussion in my post. And in the table I say "Explicitly non compliant" and later than you are very likely gonna be the top overall.

I will update later when you are there. Good luck!

buybackoff commented 10 months ago

@buybackoff I'd like to have a request if allowed. Later when @RagnarGrootKoerkamp submit the compliant version, can you run our programs with hyper threading off? I want to investigate HT performance on Intel vs AMD, so another data point would be great. His code is especially interesting because it seems to perform better without HT. Thanks!

@lehuyduc Maybe some time later. And only if that may impact the final result.

Or if you tell me how to do that without physical access to Proxmox. It's just very inconvenient for me to connect keyboard/mouse/monitor to enter into BIOS.

RagnarGrootKoerkamp commented 10 months ago

Ah I misread that, thanks for clarifying. I doesn't surprise me the big dataset is nearly as fast as the small one, the only thing that changes is that the backing array of size 800 then has size 20k, so it'll be in L3 cache instead of L1 so reads will have higher latency, but other than that having more cities shouldn't matter too much.

(In fact it kinda surprises me this affects other solutions quite a bit.)

lehuyduc commented 10 months ago

(In fact it kinda surprises me this affects other solutions quite a bit.)

There are 2 main reasons, boils down to optimizing for the original 413 station names:

Is it possible to build a perfect hash while being compliant to the spirit of the rules?

RagnarGrootKoerkamp commented 10 months ago

For hashing, I may switch to something based on some AES encryption instructions, inspired by GxHash.

For detecting hash collisions, I plan to store a 32 or 64bit hash in each record. If a new city appears and its hash doesn't match, I can rebuild the perfect hash function without problems.

So then the only problem becomes that keys could have hash collisions, but even for 32bit hashes you need around 2^16=64k different city names for collisions to become likely. I would consider this within the spirit of the rules of the challenge, and if not going to 64bit hashes for sure has super small collision probability.

(Also, I thought that the measurements3 file was only short names but it seems that I was wrong. Indeed having some longer names in there will make things slower for sure.)

lehuyduc commented 10 months ago

The author has this test case if you want to stress test hash collision: https://github.com/gunnarmorling/1brc/blob/main/src/test/resources/samples/measurements-10000-unique-keys.txt

buybackoff commented 10 months ago

I would consider this within the spirit of the rules of the challenge, and if not going to 64bit hashes for sure has super small collision probability.

Personally, I do not like the original rules. I would see how many weather stations actually exist in the world and use that. And then data < RAM and data >> RAM, e.g. 100GB on 32GB RAM machine.

lehuyduc commented 10 months ago

Personally, I do not like the original rules. I would see how many weather stations actually exist in the world and use that.

Actually I did exactly that at the start too, and built a perfect hash function by trying a bunch of prime numbers 😂 then i switched to generic code later

tivrfoa commented 10 months ago

(In fact it kinda surprises me this affects other solutions quite a bit.)

There are 2 main reasons, boils down to optimizing for the original 413 station names:

* Compliant code have to parse the whole 100 key characters (likely using slow code and assume it rarely happens), while this Rust commit assumes each key is uniquely identified by first 8 and last 8 characters (which is wrong but it's practically true), that's at least 82% lower cost of hashing.

* Most solutions choose their hash style to have low or zero collision with the original 413 station names. So they sacrifice general quality for speed in that specific case.

Is it possible to build a perfect hash while being compliant to the spirit of the rules?

@lehuyduc your solution had a big drop in the 10k case. @buybackoff was only 2.13 times slower. Yours was 5.25

Do you think you can make some improvements for the 10k? I hope you can =)

lehuyduc commented 10 months ago

@tivrfoa sure it's sure possible, but at the cost of the original case, so it's not the goal of this contest. I can just swap the hash map to another one and increase the max capacity

RagnarGrootKoerkamp commented 10 months ago

I've fixed my code so that it should also handle the large version without bugs at least. For me it runs in 1.90s, compared to 1.10s for the small dataset. Most of this difference seems to be just from branch-misses.

I have some code locally for detecting hash collisions/new city names (5% slowdown), and also for better hashing of city names (+20% slowdown; need to improve this), but these need some cleaning up before I push them.

tivrfoa commented 10 months ago

@tivrfoa sure it's sure possible, but at the cost of the original case, so it's not the goal of this contest. I can just swap the hash map to another one and increase the max capacity

I guess the contest will run the 10k for the top 10 places, so it's important to perform well there too.

buybackoff commented 10 months ago

I've fixed my code so that it should also handle the large version without bugs at least.

I guess we have the winner! The margin is so high 😮

lehuyduc commented 10 months ago

@buybackoff Yup, I don't see any other solution coming close to the Rust solution. But without bugs doesn't mean compliant though, right now there are 2 important assumptions (all keys appear in first 1M lines => build perfect hash => no need to handle collision check, only first and last 8 characters of keys matter the rest are discarded). So if we really want to be strict, it'll not pass testing (Codeforce people try to "hack" others' solutions as a tradition 🤣 )

But he's working on a compliant one. I'm waiting for his hash map because I need to borrow that for another contest :D

buybackoff commented 10 months ago

The new results are here: https://github.com/xoofx/Fast1BRC/issues/1#issuecomment-1890874512

Yup, I don't see any other solution coming close to the Rust solution.

So should I include the results already?

I may add 100-bytes random name at the very end of the big file. That would be a fair simple test case.

lehuyduc commented 10 months ago

From @RagnarGrootKoerkamp comments above, there are 2 test cases where it fails:

I currently rely a lot on the fact that each row contains a random city name. I can't handle cases where, say, a city only starts appearing somewhere halfway through the data.
I'm also relying on the fact that cities are uniquely determined by their first and last 8 characters. Thus, cities AAAAAAAAXAAAAAAAA and AAAAAAAAYAAAAAAAA currently are indistinguishable to my code.

I think you can just wait a bit because his fix is coming. It won't affect ranking anyway, and with the fixed code nobody can question the result :p

Actually my solution is also bugged somewhere 🤣 You should mark it as such, I'll update later. I wonder how many bugs we can find if we stress test everyone's code https://github.com/lehuyduc/1brc-simd/issues/4

Edit: oh, the tester made a mistake. My code was actually still correct :partying_face:

RagnarGrootKoerkamp commented 10 months ago

Allright, pushed the fixes to https://github.com/RagnarGrootKoerkamp/1brc/tree/compliant so you can time and include them :)

Will merge after I fix those.

RagnarGrootKoerkamp commented 10 months ago

Ah so apparently storing hashes of cities instead of actual names is not ok.

https://twitter.com/nietras1/status/1746162801729564812?t=HiDNrbLli0QYwJRPzlm9RQ&s=19

buybackoff commented 9 months ago

Hi @RagnarGrootKoerkamp

I cannot build you repo using the latest nightly anymore image

And since we are so close to the end, is you compliant branch good enough or just finished for tests?

RagnarGrootKoerkamp commented 9 months ago

ah yes these functions were stabilized under a different name. Will fix tomorrow afternoon.

compliant is still not officially within the rules since it still stores and compares hashes of the elements rather than strings themselves.

buybackoff commented 9 months ago

@RagnarGrootKoerkamp any updates? I really want to run your code even with an asterisk and in the non-compliant state. Just need it to be buildable from the current nightly master in an automatic way: checkout/pull -> cargo build.

RagnarGrootKoerkamp commented 7 months ago

As things go, my attention shifted before completing things. I'll just close this now :sweat_smile:

buybackoff commented 7 months ago

Everyone's attention to 1BRC dropped to near zero right after the deadline. I'm having hard time forcing myself to finalize the blog post 😅

RagnarGrootKoerkamp commented 7 months ago

Yeah it's unfortunate.

I kinda lost motivation after seeing some pushback on methods that work with high probability. I really wanted to use my minimal perfect hash function, but the rules basically force one to use/reimplement a proper hash function, and then special case everything for the string sizes in the two input sets, which is exactly what I was avoiding from the start :sweat_smile:

Good luck wrapping the post; hope you find some motivation!