databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.33k stars 356 forks source link

Provide an utility to add an increasing id column #1612

Closed candalfigomoro closed 4 years ago

candalfigomoro commented 4 years ago

In pandas, adding a new column with an increasing sequential id is trivial:

df.insert(0, "id", range(len(df)))

so we don't need any specific function.

Koalas itself does create an increasing id when it generates a default index (see https://github.com/databricks/koalas/blob/master/databricks/koalas/internal.py#L514) but attach_default_index, attach_distributed_column and attach_distributed_sequence_column methods are internal methods and they are not supposed to be directly used (see the comment at https://github.com/databricks/koalas/blob/master/databricks/koalas/internal.py#L74).

Moreover, those internal methods work on Spark dataframes (not Koalas dataframes), so to use them we should call to_spark() (paying attention to indexes), add the increasing id column and then convert the Spark dataframe back to a Koalas dataframe.

Please expose utilities to add increasing ids (sequential or not) to Koalas dataframes, as we can't do it the pandas way.

Callum027 commented 4 years ago

More a workaround than anything, but here's a way to add an ID column using Spark's monotonically_increasing_id function that should be good enough for your use case.

import pyspark.sql.functions as F
df["id"] = F.monotonically_increasing_id()
df = df[["id"] + [col for col in df.columns if col != "id"]]