apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.55k stars 3.54k forks source link

[C++][Compute] Add scalar_hash function #17211

Open asfimport opened 4 years ago

asfimport commented 4 years ago

The purpose of this function is to compute 32- or 64-bit hash values for each cell in an Array. Hashes for nested types can be computed recursively by combining the hash values of their children

Reporter: Wes McKinney / @wesm Assignee: Aldrin Montana / @drin

Subtasks:

Note: This issue was originally created as ARROW-8991. Please see the migration documentation for further details.

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: Right, this is what I had proposed in ARROW-3978 as well.

asfimport commented 4 years ago

Wes McKinney / @wesm: If you're interested in working on this, I'll be tied up with some other things for the next few days, otherwise I'll tackle it after that

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: It'll depend on the other things I have on my plate. Is this a dependency of something else?

asfimport commented 4 years ago

Wes McKinney / @wesm: It's needed for implementing hash aggregations (and any other grouping-type algorithm). No need to rearrange priorities, just wanted to mention.

asfimport commented 3 years ago

Wes McKinney / @wesm: Seems like this could be implemented now?

asfimport commented 3 years ago

Niranda Perera / @nirandaperera: Is there anyone working on this ATM? If not, I can take this up. @wesm is there a preference of a hash function, ex: Murmur etc?

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: The underlying idea is to reuse the hash functions already used for hash kernels.

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: Don't we already have this for the group-by aggregation and joining? As in, the algorithms may already be there, you would just have to expose a scalar kernel. (Alternatively, since we already have those functions, is this still valuable?)

asfimport commented 2 years ago

Aldrin Montana / @drin: this PR is ready for review if anyone has time

drin commented 1 year ago

converted the PR to a draft; I can come back to it in about a week

drin commented 1 year ago

okay, it was a bit longer than I hoped for, but I'll try to pick this back up next week

drin commented 4 months ago

For documentation purposes, the scalar_hash function implementation this issue covers is a generalized compute function that can take any type of arrow::Array.

At the time I started this work:

For this issue, I'll try to simply close the loop by finishing implementation of a scalar_hash function that works on any type of arrow::Array by converting it to a KeyColumnArray even if that means flattening it from a nested structure to a non-nested structure.

The reason I mention this, is that the documentation for KeyColumnArray says:

A "key" column is a non-nested, non-union column \see KeyColumnMetadata

This may become a bit misleading or inaccurate once this implementation is complete. Then, either:

  1. the language will need to be updated to match usage
    • in case a KeyColumnArray is logically a nested array that is physically a non-nested array for the purposes of hashing
    • maybe this is what is already is and no real changes are necessary?
  2. the code should be re-architected for clarity
    • in case it is better to create a similar, but differently named structure like HashStructArray that is a physically similar to KeyColumnArray but semantically treated like a StructArray that can be hashed directly.

I will create a new issue and circulate a discussion email to the dev ML when the time comes.