dask / dask

Parallel computing with task scheduling
https://dask.org
BSD 3-Clause "New" or "Revised" License
12.6k stars 1.71k forks source link

Unexpected sort on binop between Dask.dataframe and pandas.Series #2840

Open TomAugspurger opened 7 years ago

TomAugspurger commented 7 years ago
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.dataframe.utils import assert_eq

def test_binop():
    df = pd.DataFrame(np.random.uniform(size=(5, 11))).rename(columns=str)
    ddf = dd.from_pandas(df, 2)

    result = ddf / ddf.mean(0).compute()
    expected = df / df.mean(0)
    assert_eq(result, expected)

In [29]: result.columns
Out[29]: Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11'], dtype='object')

In [30]: result.compute().columns
Out[30]: Index(['0', '1', '10', '11', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')
TomAugspurger commented 7 years ago

I think the issue is in map_partitions. To support

df.div(val, axis=0)

where val is a pandas or dask Series, we need to align with df. We don't want to do this for axis=1. For .div style binops, we modify the call to work around this this here

This unwanted alignment has come up in a few places recently https://github.com/dask/dask/issues/2809, https://github.com/dask/dask/issues/2807. I wonder if it's worthwhile to let the caller specify which elements of *args should go through _maybe_from_pandas and _maybe_align_partitions.