Open hendrikmakait opened 7 months ago
I don't see this as a bug at the moment, we aim to only use one read_parquet call per data source to avoid reading the same columns more than once. This is a specialised example since we actually could separate them out, but I think that this is a little bit of an edge case. I wouldn't focus too much time on this at the moment, although I agree with you that we could be smarter about it. The problem is more that loses in value if there are operations in between read_parquet and the column restriction (like replace, shuffle, ...), since we would do the ops twice in this case. We can certainly make this special case better, but I am not sure if this would help us much in the grand scheme of things
Maybe related to the new CI failure #1363?
From a preliminary look at the optimized graph, one issue might be that we don't properly push projections into the parquet reads:
Snippet from the graph:
I'd expect
ReadParquet
to only read['l_orderkey', 'l_suppkey']
. Combined with https://github.com/dask-contrib/dask-expr/issues/854, this appears to be fairly catastrophic.