Add reservoir sampling - Githubissues

Is your feature request related to a problem or challenge?

We have a large sample of statistical data. All we need is a subset of the data that maintains statistical significance while being able to return a much smaller result to users since insignificantly small values aren't contained resulting in much lower latency.

Describe the solution you'd like

Add the ability to (statistically) sample rows. We've done this using reservoir sampling before. I imagine statistical sampling is a widely enough used function that it should be supported first-class.

Describe alternatives you've considered

I don't know enough about DataFusion to know whether this is possible via a UDF. In the past, we've had issues where records pushed into the query layer are sampled. However, the underlying record is still held onto as immediately materializing it would result in tiny and inefficient 1-row records, but eventually, they need to be materialized as otherwise memory explodes.

Additional context

No response

apache / datafusion

Add reservoir sampling #11554

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context