Open asfimport opened 4 years ago
Antoine Pitrou / @pitrou: Right, this is what I had proposed in ARROW-3978 as well.
Wes McKinney / @wesm: If you're interested in working on this, I'll be tied up with some other things for the next few days, otherwise I'll tackle it after that
Antoine Pitrou / @pitrou: It'll depend on the other things I have on my plate. Is this a dependency of something else?
Wes McKinney / @wesm: It's needed for implementing hash aggregations (and any other grouping-type algorithm). No need to rearrange priorities, just wanted to mention.
Wes McKinney / @wesm: Seems like this could be implemented now?
Niranda Perera / @nirandaperera: Is there anyone working on this ATM? If not, I can take this up. @wesm is there a preference of a hash function, ex: Murmur etc?
Antoine Pitrou / @pitrou: The underlying idea is to reuse the hash functions already used for hash kernels.
Neal Richardson / @nealrichardson: Don't we already have this for the group-by aggregation and joining? As in, the algorithms may already be there, you would just have to expose a scalar kernel. (Alternatively, since we already have those functions, is this still valuable?)
Aldrin Montana / @drin: this PR is ready for review if anyone has time
converted the PR to a draft; I can come back to it in about a week
okay, it was a bit longer than I hoped for, but I'll try to pick this back up next week
For documentation purposes, the scalar_hash function implementation this issue covers is a generalized compute function that can take any type of arrow::Array
.
At the time I started this work:
vector<KeyColumnArray>
ArraySpan
was first being implementedFor this issue, I'll try to simply close the loop by finishing implementation of a scalar_hash function that works on any type of arrow::Array
by converting it to a KeyColumnArray
even if that means flattening it from a nested structure to a non-nested structure.
The reason I mention this, is that the documentation for KeyColumnArray
says:
A "key" column is a non-nested, non-union column \see KeyColumnMetadata
This may become a bit misleading or inaccurate once this implementation is complete. Then, either:
KeyColumnArray
is logically a nested array that is physically a non-nested array for the purposes of hashingHashStructArray
that is a physically similar to KeyColumnArray
but semantically treated like a StructArray
that can be hashed directly.I will create a new issue and circulate a discussion email to the dev ML when the time comes.
The purpose of this function is to compute 32- or 64-bit hash values for each cell in an Array. Hashes for nested types can be computed recursively by combining the hash values of their children
Reporter: Wes McKinney / @wesm Assignee: Aldrin Montana / @drin
Subtasks:
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-8991. Please see the migration documentation for further details.