coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
28 stars 17 forks source link

[TPC-H] Query 21 times out at scale 100 #1362

Open hendrikmakait opened 7 months ago

hendrikmakait commented 7 months ago

From a preliminary look at the optimized graph, one issue might be that we don't properly push projections into the parquet reads:

Snippet from the graph:

Projection: columns=['l_orderkey', 'l_suppkey']
    FusedIO:
        ReadParquet: path='./tpch-data/scale-10/lineitem' columns=['l_orderkey', 'l_suppkey', 'l_commitdate', 'l_receiptdate'] filesystem=None kwargs={'dtype_backend': None}

I'd expect ReadParquet to only read ['l_orderkey', 'l_suppkey']. Combined with https://github.com/dask-contrib/dask-expr/issues/854, this appears to be fairly catastrophic.

phofl commented 7 months ago

I don't see this as a bug at the moment, we aim to only use one read_parquet call per data source to avoid reading the same columns more than once. This is a specialised example since we actually could separate them out, but I think that this is a little bit of an edge case. I wouldn't focus too much time on this at the moment, although I agree with you that we could be smarter about it. The problem is more that loses in value if there are operations in between read_parquet and the column restriction (like replace, shuffle, ...), since we would do the ops twice in this case. We can certainly make this special case better, but I am not sure if this would help us much in the grand scheme of things

milesgranger commented 7 months ago

Maybe related to the new CI failure #1363?