data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
99 stars 20 forks source link

Joins, and joining columns #343

Closed MarcoGorelli closed 6 months ago

MarcoGorelli commented 8 months ago

We've got some details to sort out

Inner join on overlapping column (no issue)

df1.join(df2, how='inner', left_on='a', right_on='a')

Then, clearly, the output should contain a single occurrence of column 'a'. E.g.

If we do an inner join, then the output should clearly be:

In [44]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'a': [2, 3], 'b': [2,4]})
    ...: left.merge(right, left_on='a', right_on='a')
Out[44]:
   a  b
0  2  2
In [45]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'a': [2, 3], 'b': [2,4]})
    ...: left.join(right, left_on='a', right_on='a')
Out[45]:
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 2   │
└─────┴─────┘

Inner join on non-overlapping column - keep both joining columns?

Here pandas and polars differ:

In [47]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'b': [2,4]})
    ...: left.merge(right, left_on='a', right_on='b')
Out[47]:
   a  b
0  2  2

In [48]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'b': [2,4]})
    ...: left.join(right, left_on='a', right_on='b')
Out[48]:
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 2   │
└─────┘

The joining columns are guaranteed to have the same values at all rows in an inner join, so I think the extra column in the pandas case is redundant?

Left join on overlapping column

Here I think I'd have expected both joining columns to appear, as they're not guaranteed to be the same on each row.

But neither pandas nor polars do that

In [61]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'a': [2,4]})
    ...: left.merge(right, left_on='a', right_on='a', how='left')
Out[61]:
   a
0  1
1  2

In [62]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'a': [2,4]})
    ...: left.join(right, left_on='a', right_on='a', how='left')
Out[62]:
shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
└─────┘

Left join on non-overlapping column

In [54]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'b': [2,4]})
    ...: left.merge(right, left_on='a', right_on='b', how='left')
Out[54]:
   a    b
0  1  NaN
1  2  2.0

In [55]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'b': [2,4]})
    ...: left.join(right, left_on='a', right_on='b', how='left')
Out[55]:
shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
└─────┘

In a left join, there's no guarantee that the joining columns will be equal on each row, so I think the pandas one is more complete here?

Outer join on overlapping column

In [56]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'a': [2,4]})
    ...: left.merge(right, left_on='a', right_on='a', how='outer')
Out[56]:
   a
0  1
1  2
2  4

In [57]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'a': [2,4]})
    ...: left.join(right, left_on='a', right_on='a', how='outer')
Out[57]:
shape: (3, 2)
┌──────┬─────────┐
│ a    ┆ a_right │
│ ---  ┆ ---     │
│ i64  ┆ i64     │
╞══════╪═════════╡
│ 2    ┆ 2       │
│ null ┆ 4       │
│ 1    ┆ null    │
└──────┴─────────┘

Even though the joining column has the same name, Polars keeps track of its values for each row from the original dataframes, and appends '_right' to one of the names. I think this one's more complete here

Outer join, non-overlapping columns

Both keep both, no issue

In [58]: left = pd.DataFrame({'a': [1,2]})
    ...: right = pd.DataFrame({'b': [2,4]})
    ...: left.merge(right, left_on='a', right_on='b', how='outer')
Out[58]:
     a    b
0  1.0  NaN
1  2.0  2.0
2  NaN  4.0

In [59]: left = pl.DataFrame({'a': [1,2]})
    ...: right = pl.DataFrame({'b': [2,4]})
    ...: left.join(right, left_on='a', right_on='b', how='outer')
Out[59]:
shape: (3, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 2    ┆ 2    │
│ null ┆ 4    │
│ 1    ┆ null │
└──────┴──────┘
anmyachev commented 8 months ago
MarcoGorelli commented 8 months ago

I'd suggest returning both, if users want the coalesced version they can coalesce

And regarding memory, I'd say that it's up to the implementation to optimise that

MarcoGorelli commented 8 months ago

From discussion today:

Some other options could be:

Of these options, it seems there was most appetite for the last one