apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.55k stars 3.54k forks source link

[Java] Represent a data element of a vector as a tree of ArrowBufPointer #23509

Open asfimport opened 4 years ago

asfimport commented 4 years ago

For a fixed/variable width vector, each of its data element can be represented as an ArrowBufPointer object, which represents a contiguous memory segment. This makes many tasks easier and more efficient (without memory copy): calculating hash code, comparing values, etc.

This cannot be achieved for complex vectors, because their values often reside in more than one contiguous memory regions. However, it can be seen that the contiguous memory regions for each data element forms a tree-like structure, whose leaf nodes are the contiguous memory regions. For example, a data element for a struct vector forms a tree, whose root corresponds to the struct vector, while the child vectors corresponds to the child nodes of the tree root.

In this issue, we provide a data structure that represents each data element of a vector as a tree, whose leaf nodes are ArrowBufPointers, representing contiguous memory regions for the data element.

With this data structure, many tasks also becomes easier and more efficient: calculating hash code, comparing vector elements (ordering & equality). In addition, we can do something that could not have been done in the past, like placing data elements into a hash table/hash set, etc.

Reporter: Liya Fan / @liyafan82

PRs and other links:

Note: This issue was originally created as ARROW-7213. Please see the migration documentation for further details.

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.