Describe the issue: With query planning turned on, a dropna-then-merge that results in overlapping columns will correctly determine what columns it should create but they don't actually exist when that merge is computed.
Minimal Complete Verifiable Example:
import dask
# If you uncomment this, it works
# dask.config.set({"dataframe.query-planning": False})
import dask.dataframe as dd
import pandas as pd
sample_df = dd.from_pandas(pd.DataFrame({'foo': ["1", "2", "3"], 'bar': ["4", "5", "6"]}), npartitions=1)
sample_df_dropna = sample_df.dropna(subset=["foo"], how="any")
print(sample_df.merge(
sample_df,
on=["foo"], how="left"
).columns)
print(sample_df.merge(
sample_df,
on=["foo"], how="left"
).compute().columns)
print(sample_df_dropna.merge(
sample_df_dropna,
on=["foo"], how="left"
).columns)
print(sample_df_dropna.merge(
sample_df_dropna,
on=["foo"], how="left"
).compute().columns)
The last print statement shows that when the expression is actually computed, bar_x and bar_y aren't created by the merge as they should be.
Describe the issue: With query planning turned on, a dropna-then-merge that results in overlapping columns will correctly determine what columns it should create but they don't actually exist when that merge is computed.
Minimal Complete Verifiable Example:
The last
print
statement shows that when the expression is actually computed, bar_x and bar_y aren't created by the merge as they should be.Anything else we need to know?:
Environment: