Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
Apache License 2.0
1.82k stars 113 forks source link

Fix poor performance on (local) Parquet files with many rowgroups #2257

Open jaychia opened 1 month ago

jaychia commented 1 month ago

Describe the bug

Daft's local Parquet reader is slow when reading Parquet files with many small rowgroups. The Polars Parquet writer currently writes files like that (attached a sample file for reference) and this appears to be a corner-case that Daft does not perform well for.

Here is a sample file that will reproduce the issue: s3://daft-public-datasets/testing_data/lineitem.parquet

universalmind303 commented 1 month ago

@jaychia this file appears to not be public

 > aws s3 cp s3://daft-public-datasets/testing_data/lineitem.parquet ./lineitem.parquet --no-sign-request
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
jaychia commented 1 month ago

Just copied it to s3://daft-public-data/testing_data/bad-polars-lineitem.parquet which is our fully public bucket. Let me know if it's accessible!

universalmind303 commented 2 weeks ago

I haven't yet been able to identify a single bottleneck, but it seems like there are at least a few culprits.

I've shared a few notes in the daft slack channel https://dist-data.slack.com/archives/C052CA6Q9N1/p1716496836116429.