apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.63k stars 807 forks source link

Implement `unique` function #4698

Open izveigor opened 1 year ago

izveigor commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Part of https://github.com/apache/arrow-datafusion/issues/7289

Describe the solution you'd like unique function removes duplicate values (include NULLs) from Array. Does preserve the original order.

Example (PyArrow):

>>> pc.unique(pa.array([1, 2, 1, 4, 5, 2, 4, 1]))
<pyarrow.lib.Int64Array object at 0x7f849d1f2040>
[
  1,
  2,
  4,
  5
]

with nulls:

>>> pc.unique(pa.array([1, 2, 1, 4, None, 5, 2, None, 4, 1]))
<pyarrow.lib.Int64Array object at 0x7f849d1f2040>
[
  1,
  2,
  4,
  null,
  5
]

Describe alternatives you've considered

Additional context Documentation: https://arrow.apache.org/docs/cpp/compute.html

tustvold commented 1 year ago

I'm not sure we typically provide these sorts of grouping/aggregation functions in arrow-rs? How is it you intend to integrate such a function with the grouping machinery in DataFusion?

izveigor commented 1 year ago

Hello, @tustvold! I created the ticket as I found the similar syntax in other versions of Apache Arrow. If you consider this function is redundant it is better to implement the functionality in Arrow Datafusion and close the ticket.

tustvold commented 1 year ago

What do you think of implementing this in DataFusion first, and we can then assess whether we upstream a version of it?