apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

[R] Add slice() method #29395

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Rescoped to be just slice(). The other functions were implemented in ARROW-13766.

Original description:


Implement slice(), slice_head(), and slice_tail() methods for ArrowTabular, Dataset, and arrow_dplyr_query objects . I believe this should be relatively straightforward, using Take() to return only the specified rows. We already have a head() method which I believe we can reuse for slice_head().

Reporter: Ian Cook / @ianmcook

Related issues:

Note: This issue was originally created as ARROW-13767. Please see the migration documentation for further details.

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: Just a note that there is also a C++ Table::Slice method that can already be directly used for this (not a compute kernel, though, but method on the Table). The pyarrow.Table.slice method uses that (https://github.com/apache/arrow/blob/26a34c3a2300620787806c5a8cee08ff30610e3e/python/pyarrow/table.pxi#L1307-L1334).

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: Right, we use Slice internally

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: I'm not sure about these (and about head/tail in general now) since dataset scans are no longer in deterministic order. How do dbplyr backends support these for databases that behave similarly?

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche:

since dataset scans are no longer in deterministic order

What do you mean exactly? (did something change recently? I thought the scans can have deterministic order if needed) For example the Scanner has a Head method to returns the first n rows (I would suppose this is deterministic?)

asfimport commented 3 years ago

Weston Pace / @westonpace: This is an interesting question. The datasets API does have an ordered scan. This is used by the head method.

However, R has moved away from the datasets API and is using the exec plan directly now. The datasets API's ordering is not part of the exec plan.

asfimport commented 3 years ago

Jonathan Keane / @jonkeane: The dplyr docs say they "do not work with relational databases"

Though, IME, many real-world uses of slice(), slice_head(), slice_tail() happen after an arrange:


library(dplyr)

mtcars %>% 
  group_by(cyl) %>% 
  arrange(mpg) %>% 
  slice_head(n = 2)

These are better written using slice_min(), though there has been some evolution around that + top_n() and the like. I've seen code like the above (with arrange + slice_head) a lot.

asfimport commented 3 years ago

Weston Pace / @westonpace: This has come up a few times and I think the resolution is always "we'll probably have to figure that out at some point" so I went ahead and sent something to the ML so we can have the conversation.

asfimport commented 2 years ago

Vitalie Spinu: This one turned to be a blocker for us during a conversion of a project from a SQL DB to  arrow dataset.

How does one go with slicing within a group before this issue is fixed? 


tbl %>% 
  group_by(simulator, model) %>%
  slice(n = 1) %>%
  collect()