apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[Discuss][C++] Replace MemoTable with a SwissTable implementation #38372

Open js8544 opened 1 year ago

js8544 commented 1 year ago

Describe the enhancement requested

This is a follow up of https://github.com/apache/arrow/issues/36059#issuecomment-1592037527. There are many cases where we use a MemoTable, e.g. set lookup functions, vector hash functions, the count_distinct aggregate function, dictionary unification etc. Their performance can be boosted with a SwissTable.

We can either:

  1. Use an existing swiss table library. This requires some work on dependency management. I recommend absl::flat_hash_map since they are the original author of swiss tables and we already has absl in our 3rd party toolchain.
  2. Write one by ourselves. The current one in acero is too customized for the join node and it seems hard to extract a general hashtable from it. If I were to write one I would probably follow the structure of absl's and replace things like memory management and bit tweaking with Arrow ones.

What do you think? @pitrou @westonpace

Component(s)

C++

pitrou commented 1 year ago

I don't know what a "Swiss table" is. Can you point to a general description?

pitrou commented 1 year ago

More generally, I'm all for a faster associative table implementation.

mapleFU commented 1 year ago

Maybe we should first having an memory-usage / performance benchmark on this?

js8544 commented 1 year ago

There's an Abseil's article and a CppCon talk that provide a good intro. The key idea is, when looking up a key via linear probing, it can check multiple slots at once.

pitrou commented 1 year ago

Ah, so it's a hash table with SIMD-optimized lookup?

js8544 commented 1 year ago

Maybe we should first having an memory-usage / performance benchmark on this?

Okay, I'll try to draft a proof of concept with the is_in kernel.

pitrou commented 1 year ago

If you're going to use SIMD lookups, then I would ask that it relies on xsimd instead of intrinsics.

js8544 commented 1 year ago

Ah, so it's a hash table with SIMD-optimized lookup?

Well it's a speedup even without SIMD instructions. To put it in a oversimplified term each slot occupies 1 byte so with int64 we can already check 8 slots at the same time. But SIMD can definitely make it even faster.

If you're going to use SIMD lookups, then I would ask that it relies on xsimd instead of intrinsics.

Sure, if we decide to write our own version.

pitrou commented 1 year ago

Well it's a speedup even without SIMD instructions. To put it in a oversimplified term each slot occupies 1 byte so with int64 we can already check 8 slots at the same time. But SIMD can definitely make it even faster.

I see, thanks for the explanation.

Sure, if we decide to write our own version.

Well, we don't want to depend on Abseil, so our own version it should probably be :-)

llama90 commented 1 year ago

There's an Abseil's article and a CppCon talk that provide a good intro. The key idea is, when looking up a key via linear probing, it can check multiple slots at once.

Thank you for explaining. I was also curious about what Swiss Join is too. :)

js8544 commented 1 year ago

I wrote a simple proof of concept for IndexIn [1] with a slightly tweaked version of absl::flat_hash_map. (Tweak is [2], a couple of operations can be skipped because our MemoTable doesn't need to support deletion). And the existing IndexIn benchmarks (results below) show about 20% ~ 60% improvement depending on the value_set size, the larger the better.

The benchmarks are run on my Mac M1 so the results may vary on a x86 machine. But I think the diff is compelling enough for us to proceed with a Swiss table implementation.

[1] https://github.com/apache/arrow/compare/main...js8544:jinshang/swiss_proof_of_concept?diff=split#diff-d288c253c3a5959aa1b6840a649a8dafe42d14ac49d1837b19dd3a44c58f7fd9R93 [2] https://github.com/apache/arrow/commit/7d14d10d67826f4be7588b8b36d7aef7655dbcb3

Benchmark Results

Before

image

After

image
pitrou commented 1 year ago

Can you post benchmark numbers on string values? While hashing integers is nice, I'm not sure that's a dominant use case.

js8544 commented 1 year ago

Sure. I chose Int32 because it was the only one tested with a large value set. I just wrote a similar set of benchmarks for strings and here's the result.

Before

image

After

image
pitrou commented 1 year ago

Ok. I agree we could replace MemoTable with something better. However, we don't want to have a dependency on Abseil, so it will have to be reimplemented.

js8544 commented 1 year ago

Ok. I agree we could replace MemoTable with something better. However, we don't want to have a dependency on Abseil, so it will have to be reimplemented.

Yes. I'll implement one using xsimd as you suggested. Also the swiss table may be worse than MemoTable on super small tables, since it checks 8 slots at once and reserves at least 8 slots on construction, as shown in the first two string benchmarks. We'll need further benchmarking to decide which hashtable to use in each individual use cases, or maybe with a dynamic dispatching mechanism.

mapleFU commented 8 months ago

I guess this might benifit the parquet dictionary encoding, which is widely used in parquet. If nobody move forward, I'll have a try next month

js8544 commented 8 months ago

Yeah I'm afraid I won't have time to do this recently. Please have a try if you wish. Thanks!

zanmato1984 commented 8 months ago

Will the swiss table currently being used in acero's hash join (aka. swiss join) help on this?

js8544 commented 8 months ago

The swiss table for hash join is very specialized and kind of hard to generalize. It can't be directly used for memotable.

SGZW commented 8 months ago

Could I try to solve this issue?

pitrou commented 8 months ago

@SGZW Anyone can work on an issue without asking for permission. Feel free to experiment and submit a PR when ready.