Open AndreiKingsley opened 5 months ago
Maybe it's useful to add GroupBy.origin: DataFrame?
that returns original dataframe if it was created via DataFrame.groupBy()
I think this is the intended behavior. The key of the group is something temporary and usually consists of columns already in the DF.
We are working on a way to access the group keys from aggregate
though (https://github.com/Kotlin/dataframe/pull/662), maybe that can be a nice alternative.
The original DataFrame can be retrieved using concat
(albeit with a different order perhaps).
Ok, anyway new concat is needed for the purpose I described.
maybe a concatWithKeys()
would be a nice addition?
I think it won't hurt to make do it by default. One might say that df.groupBy { expr { } }
is a shortcut for df.add() { }.groupBy {}
if we do it by default, then we would get duplicate columns, because the key columns are often in the groups as well
Andrey's implementation only adds "new" columns (or so i understood)
But then, what qualifies as "new"?
groupBy { expr { myCol } }
, yesgroupBy { myCol + 1 }
?groupBy { myCol named "other" }
I think we should be careful here
There's also the case where a user creates a new expr
column with a duplicate name that should still be kept, so my suggestion is the following:
Create a concatWithKeys()
that will add all key columns to the front of the groups regardless of whether they were in the DF already. Avoid naming clashes by using the ColumnNameGenerator
, for instance with DynamicDataFrameBuilder
.
Something like:
internal fun GroupBy<*, *>.concatWithKeys(): DataFrame<*> =
mapToFrames {
DynamicDataFrameBuilder()
.apply {
for (column in group.columns()) {
add(column)
}
val rowsCount = group.rowsCount()
for ((name, value) in key.toMap()) {
add(List(rowsCount) { value }.toColumn(name))
}
}
.toDataFrame()
.moveToLeft { takeLast(key.count()) }
}.concat()
Alternatively, what's arguably a lot simpler, we could just explode the groups column. Like:
internal fun GroupBy<*, *>.concatWithKeys(): DataFrame<*> =
toDataFrame().explode { groups }
This will generate extra key values where necessary and keep the grouped columns in a column group, avoiding any potential name clashes :).
concat
removes key column entirely (name and values)