Open jaychia opened 1 month ago
@jaychia this file appears to not be public
> aws s3 cp s3://daft-public-datasets/testing_data/lineitem.parquet ./lineitem.parquet --no-sign-request
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
Just copied it to s3://daft-public-data/testing_data/bad-polars-lineitem.parquet
which is our fully public bucket. Let me know if it's accessible!
I haven't yet been able to identify a single bottleneck, but it seems like there are at least a few culprits.
concat
(I think this is the biggest one)I've shared a few notes in the daft slack channel https://dist-data.slack.com/archives/C052CA6Q9N1/p1716496836116429.
Describe the bug
Daft's local Parquet reader is slow when reading Parquet files with many small rowgroups. The Polars Parquet writer currently writes files like that (attached a sample file for reference) and this appears to be a corner-case that Daft does not perform well for.
Here is a sample file that will reproduce the issue:
s3://daft-public-datasets/testing_data/lineitem.parquet