Closed candalfigomoro closed 4 years ago
More a workaround than anything, but here's a way to add an ID column using Spark's monotonically_increasing_id
function that should be good enough for your use case.
import pyspark.sql.functions as F
df["id"] = F.monotonically_increasing_id()
df = df[["id"] + [col for col in df.columns if col != "id"]]
In pandas, adding a new column with an increasing sequential id is trivial:
so we don't need any specific function.
Koalas itself does create an increasing id when it generates a default index (see https://github.com/databricks/koalas/blob/master/databricks/koalas/internal.py#L514) but
attach_default_index
,attach_distributed_column
andattach_distributed_sequence_column
methods are internal methods and they are not supposed to be directly used (see the comment at https://github.com/databricks/koalas/blob/master/databricks/koalas/internal.py#L74).Moreover, those internal methods work on Spark dataframes (not Koalas dataframes), so to use them we should call
to_spark()
(paying attention to indexes), add the increasing id column and then convert the Spark dataframe back to a Koalas dataframe.Please expose utilities to add increasing ids (sequential or not) to Koalas dataframes, as we can't do it the pandas way.