dask / dask-expr

BSD 3-Clause "New" or "Revised" License
83 stars 22 forks source link

Incorrect ``unique`` result for column with numerical name #1015

Closed rjzamora closed 5 months ago

rjzamora commented 5 months ago

I'm getting an incorrect unique result after selecting a numerical column name from a DataFrame collection:

import pandas as pd
import dask.dataframe as dd

pdf = pd.DataFrame({1: [0,3,0,1,2,0,3,1,1]})
df = dd.from_pandas(pdf, 3)

df[key].unique().compute()
0    0
1    1
2    3
0    3
1    2
2    0
3    1
Name: 1, dtype: int64

I haven't had time to dig into this yet, but it seems like things work fine when the column name is 0. Also, changing the shuffle algorithm doesn't seem to resolve the issue.

phofl commented 5 months ago

0 only worked by accident :)

rjzamora commented 5 months ago

0 only worked by accident :)

Heh - Yeah, I just discovered the same thing. Doesn't seem like there is support for int column names at the moment.

phofl commented 5 months ago

Not if you shuffle there (but my pr addresses it)