Joins, and joining columns

MarcoGorelli commented 8 months ago

We've got some details to sort out

Inner join on overlapping column (no issue)

df1.join(df2, how='inner', left_on='a', right_on='a')

Then, clearly, the output should contain a single occurrence of column 'a'. E.g.

If we do an inner join, then the output should clearly be:

In [44]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'a': [2, 3], 'b': [2,4]})
    ...: left.merge(right, left_on='a', right_on='a')
Out[44]:
   a  b
0  2  2

In [45]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'a': [2, 3], 'b': [2,4]})
    ...: left.join(right, left_on='a', right_on='a')
Out[45]:
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 2   │
└─────┴─────┘

Inner join on non-overlapping column - keep both joining columns?

Here pandas and polars differ:

In [47]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'b': [2,4]})
    ...: left.merge(right, left_on='a', right_on='b')
Out[47]:
   a  b
0  2  2

In [48]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'b': [2,4]})
    ...: left.join(right, left_on='a', right_on='b')
Out[48]:
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 2   │
└─────┘

The joining columns are guaranteed to have the same values at all rows in an inner join, so I think the extra column in the pandas case is redundant?

Left join on overlapping column

Here I think I'd have expected both joining columns to appear, as they're not guaranteed to be the same on each row.

But neither pandas nor polars do that

In [61]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'a': [2,4]})
    ...: left.merge(right, left_on='a', right_on='a', how='left')
Out[61]:
   a
0  1
1  2

In [62]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'a': [2,4]})
    ...: left.join(right, left_on='a', right_on='a', how='left')
Out[62]:
shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
└─────┘

Left join on non-overlapping column

In [54]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'b': [2,4]})
    ...: left.merge(right, left_on='a', right_on='b', how='left')
Out[54]:
   a    b
0  1  NaN
1  2  2.0

In [55]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'b': [2,4]})
    ...: left.join(right, left_on='a', right_on='b', how='left')
Out[55]:
shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
└─────┘

In a left join, there's no guarantee that the joining columns will be equal on each row, so I think the pandas one is more complete here?

Outer join on overlapping column

In [56]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'a': [2,4]})
    ...: left.merge(right, left_on='a', right_on='a', how='outer')
Out[56]:
   a
0  1
1  2
2  4

In [57]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'a': [2,4]})
    ...: left.join(right, left_on='a', right_on='a', how='outer')
Out[57]:
shape: (3, 2)
┌──────┬─────────┐
│ a    ┆ a_right │
│ ---  ┆ ---     │
│ i64  ┆ i64     │
╞══════╪═════════╡
│ 2    ┆ 2       │
│ null ┆ 4       │
│ 1    ┆ null    │
└──────┴─────────┘

Even though the joining column has the same name, Polars keeps track of its values for each row from the original dataframes, and appends '_right' to one of the names. I think this one's more complete here

Outer join, non-overlapping columns

Both keep both, no issue

In [58]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'b': [2,4]})
    ...: left.merge(right, left_on='a', right_on='b', how='outer')
Out[58]:
     a    b
0  1.0  NaN
1  2.0  2.0
2  NaN  4.0

In [59]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'b': [2,4]})
    ...: left.join(right, left_on='a', right_on='b', how='outer')
Out[59]:
shape: (3, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 2    ┆ 2    │
│ null ┆ 4    │
│ 1    ┆ null │
└──────┴──────┘

anmyachev commented 8 months ago

Inner join on non-overlapping column - keep both joining columns?

The joining columns are guaranteed to have the same values at all rows in an inner join, so I think the extra column in the pandas case is redundant?

+1 we will reduce memory consumption.
Left join on overlapping column

Here I think I'd have expected both joining columns to appear, as they're not guaranteed to be the same on each row.

I guess the idea here is to do it by analogy with the outer join in Polars? By storing this data, one can understand which keys were not in the right operand, but is this data always needed? It turns out that memory will be wasted in cases where the user does not need it.
Left join on non-overlapping column

In a left join, there's no guarantee that the joining columns will be equal on each row, so I think the pandas one is more complete here?

If the names of the columns used to perform the merge (left_on, right_on) were ignored, then the behavior would be consistent with the previous option.
Outer join on overlapping column

Even though the joining column has the same name, Polars keeps track of its values for each row from the original dataframes, and appends '_right' to one of the names. I think this one's more complete here

It seems great to have this information, but not everyone needs it. It’s like pandas used to do sorting by default for some operation (I don’t remember exactly), and then it stopped being the default, while users have the opportunity to enable it if necessary.

MarcoGorelli commented 8 months ago

I'd suggest returning both, if users want the coalesced version they can coalesce

And regarding memory, I'd say that it's up to the implementation to optimise that

MarcoGorelli commented 8 months ago

From discussion today:

Some other options could be:

user has to specify which columns to keep, e.g. df1.join(df2, how='left', left_on='a', right_on='a', left_columns=['a', 'b', 'c'], right_columns=['d', 'e']), which would raise if left_columns and right_columns overlap

always coalesce, keep left-hand-side. Users can do

df1.assign(df1.col('a').rename('a_left')).join(df2.assign(df1.col('a').rename('a_right')), how='outer', left_on='a', right_on='a')`

coalesce if column names overlap, else keep both. Users can still get the behaviour they want by renaming beforehand

Of these options, it seems there was most appetite for the last one

data-apis / dataframe-api