Closed colin-ho closed 2 weeks ago
Comparing colin/fix-readsql-partition-bounds
(4fea9d7) with main
(6e28b3f)
⚡ 1
improvements
✅ 16
untouched benchmarks
Benchmark | main |
colin/fix-readsql-partition-bounds |
Change | |
---|---|---|---|---|
⚡ | test_show[100 Small Files] |
50.1 ms | 33.4 ms | +50.05% |
Attention: Patch coverage is 16.66667%
with 35 lines
in your changes missing coverage. Please review.
Project coverage is 78.52%. Comparing base (
2b71ffb
) to head (4fea9d7
). Report is 19 commits behind head on main.
Files with missing lines | Patch % | Lines |
---|---|---|
daft/sql/sql_scan.py | 14.63% | 35 Missing :warning: |
Oops i forgot to mark this as ready @desmondcheongzx :P
Currently, read_sql calculates partition bounds using the
PERCENTILE_DISC
function. However, this function does not scale well to large tables, as it is an expensive window + sort function. A better alternative is to take samples, then estimate partition bounds, as described in this issue: https://github.com/Eventual-Inc/Daft/issues/3245.In the meantime, we should default to using the min-max calculations instead, which was previously the fallback option.