Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.39k stars 170 forks source link

[CHORE] Expose read_sql partition bound strategy and default to min-max #3246

Closed colin-ho closed 2 weeks ago

colin-ho commented 3 weeks ago

Currently, read_sql calculates partition bounds using the PERCENTILE_DISC function. However, this function does not scale well to large tables, as it is an expensive window + sort function. A better alternative is to take samples, then estimate partition bounds, as described in this issue: https://github.com/Eventual-Inc/Daft/issues/3245.

In the meantime, we should default to using the min-max calculations instead, which was previously the fallback option.

codspeed-hq[bot] commented 3 weeks ago

CodSpeed Performance Report

Merging #3246 will improve performances by 50.05%

Comparing colin/fix-readsql-partition-bounds (4fea9d7) with main (6e28b3f)

Summary

⚡ 1 improvements ✅ 16 untouched benchmarks

Benchmarks breakdown

Benchmark main colin/fix-readsql-partition-bounds Change
test_show[100 Small Files] 50.1 ms 33.4 ms +50.05%
codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 16.66667% with 35 lines in your changes missing coverage. Please review.

Project coverage is 78.52%. Comparing base (2b71ffb) to head (4fea9d7). Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
daft/sql/sql_scan.py 14.63% 35 Missing :warning:
Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3246/graphs/tree.svg?width=650&height=150&src=pr&token=J430QVFE89&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3246?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) ```diff @@ Coverage Diff @@ ## main #3246 +/- ## ========================================== - Coverage 79.13% 78.52% -0.62% ========================================== Files 640 641 +1 Lines 77983 79095 +1112 ========================================== + Hits 61715 62107 +392 - Misses 16268 16988 +720 ``` | [Files with missing lines](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3246?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) | Coverage Δ | | |---|---|---| | [daft/io/\_sql.py](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3246?src=pr&el=tree&filepath=daft%2Fio%2F_sql.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-ZGFmdC9pby9fc3FsLnB5) | `51.85% <100.00%> (ø)` | | | [daft/sql/sql\_scan.py](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3246?src=pr&el=tree&filepath=daft%2Fsql%2Fsql_scan.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-ZGFmdC9zcWwvc3FsX3NjYW4ucHk=) | `25.54% <14.63%> (-0.02%)` | :arrow_down: | ... and [34 files with indirect coverage changes](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3246/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)
colin-ho commented 2 weeks ago

Oops i forgot to mark this as ready @desmondcheongzx :P