Closed MarcoGorelli closed 8 months ago
Inner join on non-overlapping column - keep both joining columns?
The joining columns are guaranteed to have the same values at all rows in an inner join, so I think the extra column in the pandas case is redundant?
+1 we will reduce memory consumption.
Left join on overlapping column
Here I think I'd have expected both joining columns to appear, as they're not guaranteed to be the same on each row.
I guess the idea here is to do it by analogy with the outer join in Polars? By storing this data, one can understand which keys were not in the right operand, but is this data always needed? It turns out that memory will be wasted in cases where the user does not need it.
Left join on non-overlapping column
In a left join, there's no guarantee that the joining columns will be equal on each row, so I think the pandas one is more complete here?
If the names of the columns used to perform the merge (left_on, right_on) were ignored, then the behavior would be consistent with the previous option.
Outer join on overlapping column
Even though the joining column has the same name, Polars keeps track of its values for each row from the original dataframes, and appends '_right' to one of the names. I think this one's more complete here
It seems great to have this information, but not everyone needs it. It’s like pandas used to do sorting by default for some operation (I don’t remember exactly), and then it stopped being the default, while users have the opportunity to enable it if necessary.
I'd suggest returning both, if users want the coalesced version they can coalesce
And regarding memory, I'd say that it's up to the implementation to optimise that
From discussion today:
Some other options could be:
df1.join(df2, how='left', left_on='a', right_on='a', left_columns=['a', 'b', 'c'], right_columns=['d', 'e'])
, which would raise if left_columns
and right_columns
overlapdf1.assign(df1.col('a').rename('a_left')).join(df2.assign(df1.col('a').rename('a_right')), how='outer', left_on='a', right_on='a')`
Of these options, it seems there was most appetite for the last one
We've got some details to sort out
Inner join on overlapping column (no issue)
Then, clearly, the output should contain a single occurrence of column
'a'
. E.g.If we do an inner join, then the output should clearly be:
Inner join on non-overlapping column - keep both joining columns?
Here pandas and polars differ:
The joining columns are guaranteed to have the same values at all rows in an inner join, so I think the extra column in the pandas case is redundant?
Left join on overlapping column
Here I think I'd have expected both joining columns to appear, as they're not guaranteed to be the same on each row.
But neither pandas nor polars do that
Left join on non-overlapping column
In a left join, there's no guarantee that the joining columns will be equal on each row, so I think the pandas one is more complete here?
Outer join on overlapping column
Even though the joining column has the same name, Polars keeps track of its values for each row from the original dataframes, and appends
'_right'
to one of the names. I think this one's more complete hereOuter join, non-overlapping columns
Both keep both, no issue