databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.33k stars 356 forks source link

why kdf.head() is much lower than sdf.show()? #2139

Closed RainFung closed 3 years ago

RainFung commented 3 years ago
sdf = read_csv('backflow.csv')
kdf = sdf.to_koalas()

# run time 75ms
sdf.show(5)

# run time 53s
kdf.head()

image

image

kdf.head() is much lower than sdf.show().Is there any way to speed it up in koalas?

HyukjinKwon commented 3 years ago

Very likely because of the default index: https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type . Can you try with ks.set_option('compute.default_index_type', 'distributed')?

RainFung commented 3 years ago

It's much faster now. Can we set it to distributed by default. The speed gap is too big.

HyukjinKwon commented 3 years ago

distributed disables the operations between other DataFrames. It's something we should discuss. Let me close this ticket for now though.