databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.32k stars 356 forks source link

Joining koalas frame with spark #2219

Closed ysomawar closed 2 years ago

ysomawar commented 2 years ago

Hello,

I am very new to Koalas just started with understanding. I am planning to use koalas with spark for large data processing. I am trying to merge two large dataset by using koalas merge functionality, but observed that merging is not performing on spark, it is executing on local and resulting into slow performance same as pandas.

following is code block,

import databricks.koalas as ks
from pyspark.sql import SparkSession

#%% Setting up Spark

spark = SparkSession.builder \
                    .appName('koalas_test') \
                    .getOrCreate()

# Reading datasets 
kdf = ks.read_csv('file_path1')   # it create spark task with csv at NativeMethodAccessorImpl.java:0
kdf2 = ks.read_csv('file_path2') # it create spark task with csv at NativeMethodAccessorImpl.java:0

## Type of above frames is : databricks.koalas.frame.Dataframe
# Merging frames, 
kdf3= kdf.merge(kdf2,on='id')

On merge, non of the spark task got created, it is merging the frames locally not taking advantage of spark. spark version: 3.1.1

Could somebody please assist me how I can take advantage of spark for merging the frame (While using any koalas API)

Thanks in Advance.

Regards, Yogesh

itholic commented 2 years ago

I think it should take advantage of Spark since it directly leverages Spark join function internally especially here:

https://github.com/databricks/koalas/blob/07d846225d27492a9b85e58900018fda576650e3/databricks/koalas/frame.py#L7676

Btw, I recommend you to use pyspark.pandas module in PySpark, since Koalas is ported into PySpark.

ysomawar commented 2 years ago

Thank You @itholic for your recommendation. We tried same thing on pyspark.pandas it works as expected.