databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.32k stars 356 forks source link

Fix groupby-apply and transform to support additional dtypes. #2124

Closed ueshin closed 3 years ago

ueshin commented 3 years ago

Fix groupby-apply and transform to support additional dtypes.

After this, additional dtypes can be specified in the return type annotation of the UDFs for groupby-apply and transform.

>>> kdf = ks.DataFrame(
...     {
...         "a": pd.Categorical([1, 2, 3, 1, 2, 3]),
...         "b": pd.Categorical(
...             ["b", "a", "c", "c", "b", "a"], categories=["c", "b", "d", "a"]
...         ),
...     },
... )
>>> def identity(df) -> ks.DataFrame[zip(kdf.columns, kdf.dtypes)]:
...     return df
...
>>> applied = kdf.groupby("a").apply(identity)
>>> applied
   a  b
0  2  a
1  2  b
2  3  c
3  3  a
4  1  b
5  1  c
>>> applied.dtypes
a    category
b    category
dtype: object

FYI: without the fix:

>>> applied
   a  b
0  1  3
1  1  1
2  2  0
3  2  3
4  0  1
5  0  0
>>> applied.dtypes
a    int64
b    int64
dtype: object
xinrong-meng commented 3 years ago

Looks great! Thank you!

ueshin commented 3 years ago

Thanks! merging.