apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.95k stars 1.13k forks source link

Add reservoir sampling #11554

Open brancz opened 2 months ago

brancz commented 2 months ago

Is your feature request related to a problem or challenge?

We have a large sample of statistical data. All we need is a subset of the data that maintains statistical significance while being able to return a much smaller result to users since insignificantly small values aren't contained resulting in much lower latency.

Describe the solution you'd like

Add the ability to (statistically) sample rows. We've done this using reservoir sampling before. I imagine statistical sampling is a widely enough used function that it should be supported first-class.

Describe alternatives you've considered

I don't know enough about DataFusion to know whether this is possible via a UDF. In the past, we've had issues where records pushed into the query layer are sampled. However, the underlying record is still held onto as immediately materializing it would result in tiny and inefficient 1-row records, but eventually, they need to be materialized as otherwise memory explodes.

Additional context

No response

ozankabak commented 2 months ago

Seems like should be doable via a UDF/UDAF. Would be happy to help review if you want to take a stab at implementing this.