databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.32k stars 356 forks source link

Fix DataFrame.koalas.apply_batch to support additional dtypes. #2126

Closed ueshin closed 3 years ago

ueshin commented 3 years ago

Fix DataFrame.koalas.apply_batch to support additional dtypes.

After this, additional dtypes can be specified in the return type annotation of the UDFs for DataFrame.koalas.apply_batch.

>>> kdf = ks.DataFrame(
...     {"a": ["a", "b", "c", "a", "b", "c"], "b": ["b", "a", "c", "c", "b", "a"]}
... )
>>> dtype = pd.CategoricalDtype(categories=["a", "b", "c", "d"])
>>> def to_category(pdf) -> ks.DataFrame["a": dtype, "b": dtype]:
...     return pdf.astype(dtype)
...
>>> applied = kdf.koalas.apply_batch(to_category)
>>> applied
   a  b
0  a  b
1  b  a
2  c  c
3  a  c
4  b  b
5  c  a
>>> applied.dtypes
a    category
b    category
dtype: object

FYI: without the fix:

>>> applied
   a  b
0  0  1
1  1  0
2  2  2
3  0  2
4  1  1
5  2  0
>>> applied.dtypes
a    int64
b    int64
dtype: object
xinrong-meng commented 3 years ago

Looks great! Pending tests. Thanks!

ueshin commented 3 years ago

Thanks! merging.