coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
28 stars 17 forks source link

[TPC-H] Query 15 raises `ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.` at scale 100 #1361

Closed hendrikmakait closed 7 months ago

hendrikmakait commented 7 months ago

Full traceback:

________________________ test_query_15 ________________________
[gw2] darwin -- Python 3.11.7 /opt/homebrew/Caskroom/mambaforge/base/envs/tpch/bin/python3.11

client = <Client: 'tls://10.0.42.115:8786' processes=16 threads=32, memory=114.61 GiB>
dataset_path = 's3://coiled-runtime-ci/tpc-h/snappy/scale-100/'
fs = None

    @pytest.mark.shuffle_p2p
    def test_query_15(client, dataset_path, fs):
>       dask_queries.query_15(dataset_path, fs).compute()

tests/tpch/test_dask.py:79: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/tpch/dask_queries.py:799: in query_15
    return table[table.total_revenue == revenue.total_revenue.max()][
/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_expr/_collection.py:377: in __getitem__
    return new_collection(self.expr.__getitem__(other.expr))
/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_expr/_expr.py:114: in __getitem__
    frame, other, divisions=calc_divisions_for_align(frame, other)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

exprs = (Merge(6f910c2), Merge(6f910c2)['total_revenue'] == ((RenameFrame(frame=ResetIndex(frame=ToFrame(frame=Sum(frame=(Assi... False}, _slice=revenue))), columns={'revenue': 'total_revenue', 'l_suppkey': 'supplier_no'}))['total_revenue']).max())
dfs = [Merge(6f910c2), Merge(6f910c2)['total_revenue'] == ((RenameFrame(frame=ResetIndex(frame=ToFrame(frame=Sum(frame=(Assi... False}, _slice=revenue))), columns={'revenue': 'total_revenue', 'l_suppkey': 'supplier_no'}))['total_revenue']).max()]

    def calc_divisions_for_align(*exprs):
        dfs = [df for df in exprs if isinstance(df, Expr) and df.ndim > 0]
        if not all(df.known_divisions for df in dfs):
            are_co_aligned(*exprs)
>           raise ValueError(
                "Not all divisions are known, can't align "
                "partitions. Please use `set_index` "
                "to set the index."
            )
E           ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_expr/_expr.py:3460: ValueError
phofl commented 7 months ago

Sorry for those, will take a look tomorrow morning.

Looks like our test coverage could be better in those areas… I would have expected dask/dask to cover this

Hendrik Makait @.***> schrieb am Di. 6. Feb. 2024 um 19:45:

Full traceback:

____ test_query_15 ____ [gw2] darwin -- Python 3.11.7 /opt/homebrew/Caskroom/mambaforge/base/envs/tpch/bin/python3.11 client = <Client: 'tls://10.0.42.115:8786' processes=16 threads=32, memory=114.61 GiB>dataset_path = 's3://coiled-runtime-ci/tpc-h/snappy/scale-100/'fs = None

@pytest.mark.shuffle_p2p
def test_query_15(client, dataset_path, fs):>       dask_queries.query_15(dataset_path, fs).compute()

tests/tpch/testdask.py:79: _tests/tpch/dask_queries.py:799: in query_15 return table[table.total_revenue == revenue.total_revenue.max()][/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_expr/_collection.py:377: in getitem return new_collection(self.expr.getitem(other.expr))/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_expr/_expr.py:114: in getitem frame, other, divisions=calc_divisions_foralign(frame, other) _ exprs = (Merge(6f910c2), Merge(6f910c2)['total_revenue'] == ((RenameFrame(frame=ResetIndex(frame=ToFrame(frame=Sum(frame=(Assi... False}, _slice=revenue))), columns={'revenue': 'total_revenue', 'l_suppkey': 'supplier_no'}))['total_revenue']).max())dfs = [Merge(6f910c2), Merge(6f910c2)['total_revenue'] == ((RenameFrame(frame=ResetIndex(frame=ToFrame(frame=Sum(frame=(Assi... False}, _slice=revenue))), columns={'revenue': 'total_revenue', 'l_suppkey': 'supplier_no'}))['total_revenue']).max()]

def calc_divisions_for_align(*exprs):
    dfs = [df for df in exprs if isinstance(df, Expr) and df.ndim > 0]
    if not all(df.known_divisions for df in dfs):
        are_co_aligned(*exprs)>           raise ValueError(
            "Not all divisions are known, can't align "
            "partitions. Please use `set_index` "
            "to set the index."
        )E           ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_expr/_expr.py:3460: ValueError

— Reply to this email directly, view it on GitHub https://github.com/coiled/benchmarks/issues/1361, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOYQZGDIZNFIOSWVUOBK5OLYSJ22XAVCNFSM6AAAAABC4QDHFWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZDCNBVGMZTEMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

hendrikmakait commented 7 months ago

Closed as completed by https://github.com/dask-contrib/dask-expr/pull/855