[FEAT] Estimate materialized size of ScanTask better from Parquet reads

jaychia commented 1 week ago

Adds better estimation of the materialized bytes in memory for a given Parquet ScanTask.

We do this by making use of the same Parquet metadata that we use for schema inference. We take a look at the metadata and make use of various fields such as the reported uncompressed_size, compressed_size of each column. Then we use these statistics to estimate the materialized size of data for reading a Parquet file, using its size on disk.

TODOs:

[ ] Account for dictionary encoding (and potentially other forms of encoding) -- when we read Parquet we go from compressed -> uncompressed -> decoded, and I think we still need to account for encoding here when thinking about how much memory this data will take up when decoded into Daft Series.

codspeed-hq[bot] commented 1 week ago

CodSpeed Performance Report

Merging #3302 will improve performances by 42.24%

_{Comparing jay/better-scan-task-estimations-2 (b6a7b7f) with main (60ae62f)}

Summary

⚡ 1 improvements
✅ 16 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`jay/better-scan-task-estimations-2`	Change
⚡	`test_iter_rows_first_row[100 Small Files]`	388.4 ms	273.1 ms	+42.24%

kevinzwang commented 6 days ago

@jaychia is this ready for review? Looks like a lot of tests are still failing

jaychia commented 5 days ago

@jaychia is this ready for review? Looks like a lot of tests are still failing

Sorry, thanks for calling me out -- I have to do some more refactors to this PR. Taking this back into draft mode and un-requesting reviews.

codecov[bot] commented 4 days ago

Codecov Report

Attention: Patch coverage is 96.47887% with 15 lines in your changes missing coverage. Please review.

Project coverage is 77.44%. Comparing base (b6695eb) to head (b6a7b7f). Report is 16 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-scan/src/glob.rs	91.02%	7 Missing :warning:
src/daft-scan/src/size_estimations.rs	98.12%	6 Missing :warning:
src/daft-scan/src/anonymous.rs	0.00%	1 Missing :warning:
src/daft-scan/src/python.rs	66.66%	1 Missing :warning:

Additional details and impacted files

[![Impacted file tree graph](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302/graphs/tree.svg?width=650&height=150&src=pr&token=J430QVFE89&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) ```diff @@ Coverage Diff @@ ## main #3302 +/- ## ========================================== + Coverage 77.39% 77.44% +0.05% ========================================== Files 678 686 +8 Lines 83300 84006 +706 ========================================== + Hits 64469 65060 +591 - Misses 18831 18946 +115 ``` | [Files with missing lines](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) | Coverage Δ | | |---|---|---| | [src/daft-micropartition/src/micropartition.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-micropartition%2Fsrc%2Fmicropartition.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtbWljcm9wYXJ0aXRpb24vc3JjL21pY3JvcGFydGl0aW9uLnJz) | `90.85% <100.00%> (+0.03%)` | :arrow_up: | | [src/daft-micropartition/src/ops/cast\_to\_schema.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-micropartition%2Fsrc%2Fops%2Fcast_to_schema.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtbWljcm9wYXJ0aXRpb24vc3JjL29wcy9jYXN0X3RvX3NjaGVtYS5ycw==) | `100.00% <100.00%> (ø)` | | | [src/daft-scan/src/lib.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Flib.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvbGliLnJz) | `61.14% <100.00%> (+0.87%)` | :arrow_up: | | [src/daft-scan/src/scan\_task\_iters.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Fscan_task_iters.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvc2Nhbl90YXNrX2l0ZXJzLnJz) | `97.01% <100.00%> (+0.06%)` | :arrow_up: | | [src/daft-scan/src/anonymous.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Fanonymous.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvYW5vbnltb3VzLnJz) | `0.00% <0.00%> (ø)` | | | [src/daft-scan/src/python.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Fpython.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvcHl0aG9uLnJz) | `76.64% <66.66%> (-0.09%)` | :arrow_down: | | [src/daft-scan/src/size\_estimations.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Fsize_estimations.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvc2l6ZV9lc3RpbWF0aW9ucy5ycw==) | `98.12% <98.12%> (ø)` | | | [src/daft-scan/src/glob.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302?src=pr&el=tree&filepath=src%2Fdaft-scan%2Fsrc%2Fglob.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc2Nhbi9zcmMvZ2xvYi5ycw==) | `90.78% <91.02%> (+0.50%)` | :arrow_up: | ... and [25 files with indirect coverage changes](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3302/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

jaychia commented 2 days ago

Actually, I'm unhappy with this approach and think we need a more sophisticated approach. Closing this PR and going to start a new one.

The problem with the approach in this PR is that it only uses the FileMetadata, which unfortunately doesn't give us a good mechanism of actually figuring out the size of data after both decompression and decoding. More concretely, we need access to some of the data pages (the dictionary page being the most important one) in order to make good decisions.

Eventual-Inc / Daft