Open hendrikmakait opened 9 months ago
If the path is a directory then it might be faster to ask the filesystem (POSIX or object store) for the size of a directory. That might be faster and more scalable.
Yep this should cover us for a lot of tiny files
We have to perform a list operation on all stores right now unless a metadata files is provided. This list operation includes file sizes even without accessing parquet metadata.
At the moment,
dask-expr
struggles to deal with imbalanced (https://github.com/coiled/benchmarks/issues/1367#issuecomment-1936080012) or very small (https://github.com/coiled/benchmarks/issues/1381) partitions. We should improve this which likely requires overhauling the Parquet reading to collect better statistics.