dask / dask-expr

BSD 3-Clause "New" or "Revised" License
86 stars 26 forks source link

Selecting from Index with duplicates raises or returns incorrect results #89

Open phofl opened 1 year ago

phofl commented 1 year ago
from dask_expr import from_pandas

df = pd.DataFrame({"a": [1, 2, 3], "bb": 1}, index=["a", "a", "b"])
ddf = from_pandas(df)

ddf.a["b"].compute()

This raises

Traceback (most recent call last):
  File "/Users/patrick/Library/Application Support/JetBrains/PyCharm2023.1/scratches/dask_epr.py", line 11, in <module>
    print(ddf.a["b"].compute())
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/patrick/mambaforge/envs/dask-expr/lib/python3.11/site-packages/dask/base.py", line 314, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/patrick/mambaforge/envs/dask-expr/lib/python3.11/site-packages/dask/base.py", line 583, in compute
    collections, repack = unpack_collections(*args, traverse=traverse)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/patrick/mambaforge/envs/dask-expr/lib/python3.11/site-packages/dask/base.py", line 474, in unpack_collections
    repack_dsk[out] = (tuple, [_unpack(i) for i in args])
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/patrick/mambaforge/envs/dask-expr/lib/python3.11/site-packages/dask/base.py", line 474, in <listcomp>
    repack_dsk[out] = (tuple, [_unpack(i) for i in args])
                               ^^^^^^^^^^
  File "/Users/patrick/mambaforge/envs/dask-expr/lib/python3.11/site-packages/dask/base.py", line 435, in _unpack
    if is_dask_collection(expr):
       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/patrick/mambaforge/envs/dask-expr/lib/python3.11/site-packages/dask/base.py", line 186, in is_dask_collection
    return x.__dask_graph__() is not None
           ^^^^^^^^^^^^^^^^^^
  File "/Users/patrick/PycharmProjects/dask_dev/dask-expr/dask_expr/collection.py", line 86, in __dask_graph__
    out = out.simplify()
          ^^^^^^^^^^^^^^
  File "/Users/patrick/PycharmProjects/dask_dev/dask-expr/dask_expr/expr.py", line 227, in simplify
    out = expr._simplify_down()
          ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/patrick/PycharmProjects/dask_dev/dask-expr/dask_expr/expr.py", line 836, in _simplify_down
    assert a == b
           ^^^^^^
AssertionError

Process finished with exit code 1

while

ddf.a["a"].compute()

returns

a    1
a    2
b    3
Name: a, dtype: int64
mrocklin commented 1 year ago

It seems to me like the underlying issue here is that we're abusing Projection to also mean row selection. Probably we should add some row selection operation and then change __getitem__ to operate differently based on if the meta is series-like or dataframe-like. Thoughts?

rjzamora commented 1 year ago

I'm +1 on distinguishing between column projection and row selection at the Expr level (if possible).

phofl commented 1 year ago

Yep that makes the most sense to me as well!