Closed jaychia closed 2 days ago
Comparing jay/better-scan-task-estimations-2
(b6a7b7f) with main
(60ae62f)
⚡ 1
improvements
✅ 16
untouched benchmarks
Benchmark | main |
jay/better-scan-task-estimations-2 |
Change | |
---|---|---|---|---|
⚡ | test_iter_rows_first_row[100 Small Files] |
388.4 ms | 273.1 ms | +42.24% |
@jaychia is this ready for review? Looks like a lot of tests are still failing
@jaychia is this ready for review? Looks like a lot of tests are still failing
Sorry, thanks for calling me out -- I have to do some more refactors to this PR. Taking this back into draft mode and un-requesting reviews.
Attention: Patch coverage is 96.47887%
with 15 lines
in your changes missing coverage. Please review.
Project coverage is 77.44%. Comparing base (
b6695eb
) to head (b6a7b7f
). Report is 16 commits behind head on main.
🚨 Try these New Features:
Actually, I'm unhappy with this approach and think we need a more sophisticated approach. Closing this PR and going to start a new one.
The problem with the approach in this PR is that it only uses the FileMetadata, which unfortunately doesn't give us a good mechanism of actually figuring out the size of data after both decompression and decoding. More concretely, we need access to some of the data pages (the dictionary page being the most important one) in order to make good decisions.
Adds better estimation of the materialized bytes in memory for a given Parquet ScanTask.
We do this by making use of the same Parquet metadata that we use for schema inference. We take a look at the metadata and make use of various fields such as the reported uncompressed_size, compressed_size of each column. Then we use these statistics to estimate the materialized size of data for reading a Parquet file, using its size on disk.
TODOs:
compressed -> uncompressed -> decoded
, and I think we still need to account for encoding here when thinking about how much memory this data will take up when decoded into Daft Series.