Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.82k stars 113 forks source link

Fix poor performance on (local) Parquet files with many rowgroups #2257

Open jaychia opened 1 month ago

jaychia commented 1 month ago

Describe the bug

Daft's local Parquet reader is slow when reading Parquet files with many small rowgroups. The Polars Parquet writer currently writes files like that (attached a sample file for reference) and this appears to be a corner-case that Daft does not perform well for.

Here is a sample file that will reproduce the issue: s3://daft-public-datasets/testing_data/lineitem.parquet

universalmind303 commented 1 month ago

@jaychia this file appears to not be public

 > aws s3 cp s3://daft-public-datasets/testing_data/lineitem.parquet ./lineitem.parquet --no-sign-request
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
jaychia commented 1 month ago

Just copied it to s3://daft-public-data/testing_data/bad-polars-lineitem.parquet which is our fully public bucket. Let me know if it's accessible!

universalmind303 commented 2 weeks ago

I haven't yet been able to identify a single bottleneck, but it seems like there are at least a few culprits.

I've shared a few notes in the daft slack channel https://dist-data.slack.com/archives/C052CA6Q9N1/p1716496836116429.