databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.32k stars 356 forks source link

Creating Series with exist Int64Index results in error #2170

Closed amueller closed 3 years ago

amueller commented 3 years ago
from databricks import koalas
series = koalas.Series([0, 1, 2])
true_series = koalas.Series(True, index=series.index)

ValueError: The truth value of a Int64Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

this is Koalas 1.8.0 and pandas 1.2.4

true_series = koalas.Series(True, index=series.index.to_pandas())

works.

Thanks :)

itholic commented 3 years ago

Thanks for the report, @amueller .

As you mentioned in the description, Koalas doesn't allow creating the Series with the Koalas Index.

When creating the Koalas Series, the pandas DataFrame is needed for creating the InternalFrame.

So, if Koalas want to allow creating Series with the Koalas Index, we should use to_pandas() internally which is dangerous since it move the all distributed data into a single node. (Yes, just like you did in the your code explicitly)

We recommend to use to_pandas() explicitly like you did in your code for now, when only you're sure that your data size is small enough.

You can check the more detail about the Koalas internal in the Koalas internal.

itholic commented 3 years ago

Oh, anyway, Koalas will be ported into PySpark since Spark 3.2, so this repository now only in maintenance mode.

I'd recommend to use pandas module in PySpark after Spark 3.2 release.

You can find the more details in SPIP: Support pandas API layer on PySpark!

amueller commented 3 years ago

Thanks for the explanation! It would be great to allow using koalas indexes. I don't see how to do it now if the index is large. Anyway closing here if the repository is in maintenance mode.