coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
28 stars 17 forks source link

[TPC-H] Query 11 raises `ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.` at scale 100 #1360

Closed hendrikmakait closed 7 months ago

hendrikmakait commented 7 months ago

Full traceback:

________________________ test_query_11 ________________________
[gw2] darwin -- Python 3.11.7 /opt/homebrew/Caskroom/mambaforge/base/envs/tpch/bin/python3.11

client = <Client: 'tls://10.0.42.115:8786' processes=16 threads=32, memory=114.61 GiB>
dataset_path = 's3://coiled-runtime-ci/tpc-h/snappy/scale-100/'
fs = None

    @pytest.mark.shuffle_p2p
    def test_query_11(client, dataset_path, fs):
>       dask_queries.query_11(dataset_path, fs).compute()

tests/tpch/test_dask.py:59: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/tpch/dask_queries.py:580: in query_11
    res[res > threshold]
/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_expr/_collection.py:3677: in __getitem__
    return super().__getitem__(key)
/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_expr/_collection.py:377: in __getitem__
    return new_collection(self.expr.__getitem__(other.expr))
/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_expr/_expr.py:114: in __getitem__
    frame, other, divisions=calc_divisions_for_align(frame, other)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

exprs = (Sum(frame=(Assign(frame=Filter(frame=Merge(8315ad6), predicate=Merge(8315ad6)['n_name'] == GERMANY)))[['ps_partkey', ...ycost'] * (Filter(frame=Merge(8315ad6), predicate=Merge(8315ad6)['n_name'] == GERMANY))['ps_availqty']).sum() * 0.0001)
dfs = [Sum(frame=(Assign(frame=Filter(frame=Merge(8315ad6), predicate=Merge(8315ad6)['n_name'] == GERMANY)))[['ps_partkey', ...ycost'] * (Filter(frame=Merge(8315ad6), predicate=Merge(8315ad6)['n_name'] == GERMANY))['ps_availqty']).sum() * 0.0001]

    def calc_divisions_for_align(*exprs):
        dfs = [df for df in exprs if isinstance(df, Expr) and df.ndim > 0]
        if not all(df.known_divisions for df in dfs):
            are_co_aligned(*exprs)
>           raise ValueError(
                "Not all divisions are known, can't align "
                "partitions. Please use `set_index` "
                "to set the index."
            )
E           ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_expr/_expr.py:3460: ValueError
hendrikmakait commented 7 months ago

Closed as completed by https://github.com/dask-contrib/dask-expr/pull/855