apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.38k stars 3.5k forks source link

[C++] Implement hashing, dictionary-encoding for StructArray #20581

Open asfimport opened 5 years ago

asfimport commented 5 years ago

This is a central requirement for hash-aggregations such as


SELECT AGG_FUNCTION(expr)
FROM table
GROUP BY expr1, expr2, ...

The materialized keys in the GROUP BY section form a struct, which can be incrementally hashed to produce dictionary codes suitable for computing aggregates or any other purpose.

There are a few subtasks related to this, such as efficiently constructing a record (that can be hashed quickly) to identify each "row" in the struct. Maybe we should start with that first

Reporter: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-3978. Please see the migration documentation for further details.

asfimport commented 5 years ago

Antoine Pitrou / @pitrou: To implement this efficiently, we would need to split the computation of hash values (for an array or morsel) from their use in hashing kernels. It is probably possible to hash struct values efficiently, simply by first hashing the underlying child arrays, then by combining the results.

asfimport commented 5 years ago

Wes McKinney / @wesm: There's different approaches. You might want to look at what an existing columnar database engine like Clickhouse or Dremio is doing for hashing tuples (aka structs). One approach is to "pivot" or "recordize" the data from columnar to record format (similar to NumPy's struct dtype memory layout, but it would need to be generalized of course to account for varbinary, nulls, and nested data – nested data would have to be recursively flattened).

NB the same code paths involved with hashing structs will need to be used for hash joins, hash aggregations, and other algorithms.

@jacques-n do you have any advice for us or pointers to literature about this topic?

asfimport commented 5 years ago

Jacques Nadeau / @jacques-n: Here is some info about what we found worked well. Note that it doesn't go into a lot of detail about the pivot algorithm beyond the basic concepts of fixed and variable vectors.

https://docs.google.com/document/d/1Yk6IvDL28IzEjqcqSkFdevRyMrC8_kwzEatHvcOnawM/edit

 

Main idea around pivot: 

asfimport commented 5 years ago

Francois Saint-Jacques / @fsaintjacques: ClickHouse also uses the pivot method, see

asfimport commented 5 years ago

Wes McKinney / @wesm: Moving out of 0.14.0, but don't let that stop anyone from working on this

asfimport commented 3 years ago

Wes McKinney / @wesm: @bkietz @michalursa seems like this could be supported pretty soon using the new hash table machinery?

asfimport commented 3 years ago

Michal Nowakiewicz / @michalursa: It sounds like this is already implemented as part of hash group by, but we should probably add a documentation about how it works exactly. Also, currently only 32-bit hash is implemented, but we need to add 64-bit hashing as well. 

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: What do you mean with "32-bit hash"?

asfimport commented 3 years ago

Ben Kietzman / @bkietz:

seems like this could be supported pretty soon using the new hash table machinery?

Correct, and I'd say we should try replacing everything in vector_hash.cc with usage of Grouper then compare benchmarks. I think we'd probably get a performance win for dictionary encoding primitives as well as support for dictionary encoding structs

asfimport commented 2 years ago

Aldrin Montana / @drin: I am picking up ARROW-8991 today. @bkietz @pitrou , I assume that this issue is mostly using new mechanisms and then "exposing an interface". I need to build up my understanding, but the "exposing an interface" portion may be addressed by ARROW-8991.

I just wanted to ping here and double check that it makes sense to take this while I work on ARROW-8991. I also wanted to check if there's anything in particular I should look at that hasn't already been mentioned in previous comments.

asfimport commented 2 years ago

Aldrin Montana / @drin: I just wanted to note here that I am trying to handle nested types for ARROW-8991. I realize now that StructArray would be included in that umbrella, so when ARROW-8991 is complete, this would be partially completed.

I don't know the dictionary-encoding infrastructure, but if I get around to that half then I'll pick up this issue.