Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.35k stars 166 forks source link

[FEAT] Estimate materialized size of ScanTask better from Parquet reads #3302

Closed jaychia closed 2 days ago

jaychia commented 1 week ago

Adds better estimation of the materialized bytes in memory for a given Parquet ScanTask.

We do this by making use of the same Parquet metadata that we use for schema inference. We take a look at the metadata and make use of various fields such as the reported uncompressed_size, compressed_size of each column. Then we use these statistics to estimate the materialized size of data for reading a Parquet file, using its size on disk.

TODOs:

codspeed-hq[bot] commented 1 week ago

CodSpeed Performance Report

Merging #3302 will improve performances by 42.24%

Comparing jay/better-scan-task-estimations-2 (b6a7b7f) with main (60ae62f)

Summary

⚡ 1 improvements
✅ 16 untouched benchmarks

Benchmarks breakdown

Benchmark main jay/better-scan-task-estimations-2 Change
test_iter_rows_first_row[100 Small Files] 388.4 ms 273.1 ms +42.24%
kevinzwang commented 6 days ago

@jaychia is this ready for review? Looks like a lot of tests are still failing

jaychia commented 5 days ago

@jaychia is this ready for review? Looks like a lot of tests are still failing

Sorry, thanks for calling me out -- I have to do some more refactors to this PR. Taking this back into draft mode and un-requesting reviews.

codecov[bot] commented 4 days ago

Codecov Report

Attention: Patch coverage is 96.47887% with 15 lines in your changes missing coverage. Please review.

Project coverage is 77.44%. Comparing base (b6695eb) to head (b6a7b7f). Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-scan/src/glob.rs 91.02% 7 Missing :warning:
src/daft-scan/src/size_estimations.rs 98.12% 6 Missing :warning:
src/daft-scan/src/anonymous.rs 0.00% 1 Missing :warning:
src/daft-scan/src/python.rs 66.66% 1 Missing :warning:
Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302/graphs/tree.svg?width=650&height=150&src=pr&token=J430QVFE89&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) ```diff @@ Coverage Diff @@ ## main #3302 +/- ## ========================================== + Coverage 77.39% 77.44% +0.05% ========================================== Files 678 686 +8 Lines 83300 84006 +706 ========================================== + Hits 64469 65060 +591 - Misses 18831 18946 +115 ``` | [Files with missing lines](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) | Coverage Δ | | |---|---|---| | [src/daft-micropartition/src/micropartition.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-micropartition%2Fsrc%2Fmicropartition.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtbWljcm9wYXJ0aXRpb24vc3JjL21pY3JvcGFydGl0aW9uLnJz) | `90.85% <100.00%> (+0.03%)` | :arrow_up: | | [src/daft-micropartition/src/ops/cast\_to\_schema.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-micropartition%2Fsrc%2Fops%2Fcast_to_schema.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtbWljcm9wYXJ0aXRpb24vc3JjL29wcy9jYXN0X3RvX3NjaGVtYS5ycw==) | `100.00% <100.00%> (ø)` | | | [src/daft-scan/src/lib.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Flib.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvbGliLnJz) | `61.14% <100.00%> (+0.87%)` | :arrow_up: | | [src/daft-scan/src/scan\_task\_iters.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Fscan_task_iters.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvc2Nhbl90YXNrX2l0ZXJzLnJz) | `97.01% <100.00%> (+0.06%)` | :arrow_up: | | [src/daft-scan/src/anonymous.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Fanonymous.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvYW5vbnltb3VzLnJz) | `0.00% <0.00%> (ø)` | | | [src/daft-scan/src/python.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Fpython.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvcHl0aG9uLnJz) | `76.64% <66.66%> (-0.09%)` | :arrow_down: | | [src/daft-scan/src/size\_estimations.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Fsize_estimations.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvc2l6ZV9lc3RpbWF0aW9ucy5ycw==) | `98.12% <98.12%> (ø)` | | | [src/daft-scan/src/glob.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Fglob.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvZ2xvYi5ycw==) | `90.78% <91.02%> (+0.50%)` | :arrow_up: | ... and [25 files with indirect coverage changes](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)

🚨 Try these New Features:

jaychia commented 2 days ago

Actually, I'm unhappy with this approach and think we need a more sophisticated approach. Closing this PR and going to start a new one.

The problem with the approach in this PR is that it only uses the FileMetadata, which unfortunately doesn't give us a good mechanism of actually figuring out the size of data after both decompression and decoding. More concretely, we need access to some of the data pages (the dictionary page being the most important one) in order to make good decisions.