apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

[C++] Return a random sample of rows from a query #18879

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Please can we have a kernel that returns a random sample of rows? We've had a request to be able to do this in R: https://github.com/apache/arrow-cookbook/issues/83

Reporter: Nicola Crane / @thisisnic

Related issues:

Note: This issue was originally created as ARROW-14254. Please see the migration documentation for further details.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: That's easily decomposed into two different kernels:

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: Perhaps, and we do that in R now, but that runs up against various other issues around async query processing and non-deterministic order. There's probably a more optimal way cc @westonpace @aocsa

asfimport commented 3 years ago

Weston Pace / @westonpace: @nealrichardson Can you expand on the issues you are running into today? I'm not sure I follow. Also, there is no kernel that generates random integers (ARROW-12404 would be a prerequisite) so I assume you are generating the random #'s in R?

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: Maybe it's better now than I remember? We're calling Scanner::TakeRows (https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scanner.cc#L1022) with an array of indices created in R.

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: Among the new challenges, now that we can evaluate more complex queries, is that we don't always know how many rows we have in order to know how many to sample.

asfimport commented 3 years ago

Weston Pace / @westonpace: I haven't done much with TakeRows so it's possible there are issues there but I don't see any open JIRAs.

Scanner::CountRows should tell you the row count information (not for free but for roughly the same cost we would incur in C++). I'm not sure we can reliably do a "streaming" sample. I think it would always need to be two passes. We could maybe do a streaming pseudo-sample. For example, given a target selectivity of 10% we sample 10% from each batch that comes through. But that isn't the same thing as a 10% sample.

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: I can't do Scanner with joins etc., can I?

asfimport commented 3 years ago

Weston Pace / @westonpace: No. In that case are you asking for a streaming pseudo-sample as I described before? I suppose we could create a "pipeline breaking sample" which collects the entire table (spilling over as needed) to get the mid-query count, and then emits a sampled output.

asfimport commented 3 years ago

Weston Pace / @westonpace: @nealrichardson Gentle ping. I'm still not sure if this JIRA is asking for:

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: The former: give me N random rows from the dataset. I (as a user) have no idea how you're making batch sizes internally and might not even know that batches are a thing.

asfimport commented 3 years ago

Weston Pace / @westonpace: I think we'd need two passes then. First, get the size of the dataset. Second, some kind of "streaming take". Alternatively, some googling suggests the following alternatives from SQL:

  1. Generate a random value for each row, sort by this column, then use top-k
  2. Grab a random percent of rows with something like "SELECT * FROM table WHERE rand() < 0.1"
  3. Some SQL servers have a TABLESAMPLE command but it looks like that just randomly selects batches (similar to #2) and then applies top-k to that result. This means you tend to get clusters of values so I don't see what's better than #2.
  4. An alternative to 1 & 2, if you don't have rand() implemented, is to hash the index column and then AND it with some mask (this limits the percentages you can apply to 1/power-of-two, but you can then chop down to the requested size).
asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Approach 1 sounds reasonable, though I don't understand why sorting is required. Just use a streaming top-k.

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: 1 and 2 would satisfy the behavior that dplyr::slice_sample() supports. So if we had random number generating kernels, we could do them.

asfimport commented 3 years ago

Weston Pace / @westonpace:

Approach 1 sounds reasonable, though I don't understand why sorting is required. Just use a streaming top-k.

Ah, good point. I had forgotten that top-k implicitly sorted (I was thinking of it more as a head).

1 and 2 would satisfy the behavior that dplyr::slice_sample() supports. So if we had random number generating kernels, we could do them.

This would not give the exact same behavior as dplyr::slice_sample. For example:

asfimport commented 3 years ago

Weston Pace / @westonpace: If the # of target rows is low we could also do something like a "streaming take" which scans more efficiently (skipping record batches entirely if they aren't included). That might actually give better performance since a streaming top-k requires scanning the entire dataset.

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: I don't follow why you need 2 scans if you're using a RNG: it's either generate random and take top N, or generate uniform random (0, 1) and take rand < prop, like in your example 2. (The prop version may not give an exact percentage, it's only correct in expectation, but maybe that's good enough.)

asfimport commented 3 years ago

Weston Pace / @westonpace:

The prop version may not give an exact percentage

This is the only reason you would need two scans (assuming the user provided a percentage).

If the user provided an exact number then you are correct. Top-k alone is fine.

asfimport commented 3 years ago

Jonathan Keane / @jonkeane: For what it's worth: when I've seen this kind of action in the wild people almost always want a random N rows and not a random %age. Very typically it's downsampling to something reasonable to do [ML|summary stats|exploratory analysis] with, and folks will take 100k or 10k or 1M or whatever they think is reasonable.

This is (probably) a separate issue, but one thing where taking some limited number of rows, if we take them always from the beginning and the data shows up in an order (even if the order is not always exactly the same, if it's similar enough to how it's stored, for example) the randomness of the sample won't be good enough for what some people use it for. We might consider a fast (semi)random sample that does this, and then having a more truly random sample that has stronger randomness guarantees.

This is (almost definitely) a separate issue (or possibly would automagically work with this work + group_by), another common task here is random samples from some grouped set of rows e.g. "I want to have a random sample of 100 rows from each day from 1 year ago to today, resulting in 365 000 rows"