Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.38k stars 170 forks source link

Calculate partition boundaries for `read_sql` using sampling #3245

Open colin-ho opened 3 weeks ago

colin-ho commented 3 weeks ago

Is your feature request related to a problem?

Currently, read_sql calculates partition boundaries using PERCENTILE_DISC, falling back to using min - max if needed. PERCENTILE_DISC is not scalable, as it involves expensive operations such as windowing and sorting.

Describe the solution you'd like

We should instead calculate percentiles by taking samples from the input table. This will allows trade off btw accuracy and computational complexity (both time and space complexities). The parameter to control this trade off is the sampling size. The large the sampling size, the more accurate, but it will be more computational complex. If smaller sampling size will be more likely to have uneven sized partition (skew).

Describe alternatives you've considered

No response

Additional Context

No response

Would you like to implement a fix?

No