Open asfimport opened 3 years ago
Joris Van den Bossche / @jorisvandenbossche:
Just a note that there is also a C++ Table::Slice
method that can already be directly used for this (not a compute kernel, though, but method on the Table). The pyarrow.Table.slice
method uses that (https://github.com/apache/arrow/blob/26a34c3a2300620787806c5a8cee08ff30610e3e/python/pyarrow/table.pxi#L1307-L1334).
Neal Richardson / @nealrichardson: Right, we use Slice internally
Neal Richardson / @nealrichardson: I'm not sure about these (and about head/tail in general now) since dataset scans are no longer in deterministic order. How do dbplyr backends support these for databases that behave similarly?
Joris Van den Bossche / @jorisvandenbossche:
since dataset scans are no longer in deterministic order
What do you mean exactly? (did something change recently? I thought the scans can have deterministic order if needed) For example the Scanner has a Head
method to returns the first n rows (I would suppose this is deterministic?)
Weston Pace / @westonpace: This is an interesting question. The datasets API does have an ordered scan. This is used by the head method.
However, R has moved away from the datasets API and is using the exec plan directly now. The datasets API's ordering is not part of the exec plan.
Jonathan Keane / @jonkeane: The dplyr docs say they "do not work with relational databases"
Though, IME, many real-world uses of slice()
, slice_head()
, slice_tail()
happen after an arrange:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
arrange(mpg) %>%
slice_head(n = 2)
These are better written using slice_min()
, though there has been some evolution around that + top_n()
and the like. I've seen code like the above (with arrange
+ slice_head
) a lot.
Weston Pace / @westonpace: This has come up a few times and I think the resolution is always "we'll probably have to figure that out at some point" so I went ahead and sent something to the ML so we can have the conversation.
Vitalie Spinu: This one turned to be a blocker for us during a conversion of a project from a SQL DB to arrow dataset.
How does one go with slicing within a group before this issue is fixed?
tbl %>%
group_by(simulator, model) %>%
slice(n = 1) %>%
collect()
Rescoped to be just
slice()
. The other functions were implemented in ARROW-13766.Original description:
Implement
slice()
,slice_head()
, andslice_tail()
methods forArrowTabular
,Dataset
, andarrow_dplyr_query
objects . I believe this should be relatively straightforward, usingTake()
to return only the specified rows. We already have ahead()
method which I believe we can reuse forslice_head()
.Reporter: Ian Cook / @ianmcook
Related issues:
Note: This issue was originally created as ARROW-13767. Please see the migration documentation for further details.