dask / dask-expr

BSD 3-Clause "New" or "Revised" License
79 stars 18 forks source link

dask-expr planning causes incorrect merge behavior at compute time after dropna #1060

Closed zmbc closed 1 month ago

zmbc commented 1 month ago

Describe the issue: With query planning turned on, a dropna-then-merge that results in overlapping columns will correctly determine what columns it should create but they don't actually exist when that merge is computed.

Minimal Complete Verifiable Example:

import dask
# If you uncomment this, it works
# dask.config.set({"dataframe.query-planning": False})

import dask.dataframe as dd
import pandas as pd

sample_df = dd.from_pandas(pd.DataFrame({'foo': ["1", "2", "3"], 'bar': ["4", "5", "6"]}), npartitions=1)
sample_df_dropna = sample_df.dropna(subset=["foo"], how="any")

print(sample_df.merge(
    sample_df,
    on=["foo"], how="left"
).columns)

print(sample_df.merge(
    sample_df,
    on=["foo"], how="left"
).compute().columns)

print(sample_df_dropna.merge(
    sample_df_dropna,
    on=["foo"], how="left"
).columns)

print(sample_df_dropna.merge(
    sample_df_dropna,
    on=["foo"], how="left"
).compute().columns)

The last print statement shows that when the expression is actually computed, bar_x and bar_y aren't created by the merge as they should be.

Anything else we need to know?:

Environment: