dask / dask-expr

BSD 3-Clause "New" or "Revised" License
86 stars 25 forks source link

Improve handling of imbalanced or small partitions #869

Open hendrikmakait opened 9 months ago

hendrikmakait commented 9 months ago

At the moment, dask-expr struggles to deal with imbalanced (https://github.com/coiled/benchmarks/issues/1367#issuecomment-1936080012) or very small (https://github.com/coiled/benchmarks/issues/1381) partitions. We should improve this which likely requires overhauling the Parquet reading to collect better statistics.

mrocklin commented 9 months ago

If the path is a directory then it might be faster to ask the filesystem (POSIX or object store) for the size of a directory. That might be faster and more scalable.

phofl commented 9 months ago

Yep this should cover us for a lot of tiny files

fjetter commented 9 months ago

We have to perform a list operation on all stores right now unless a metadata files is provided. This list operation includes file sizes even without accessing parquet metadata.