databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.32k stars 356 forks source link

Predicate Pushdown not Working #2215

Open Lukas012 opened 2 years ago

Lukas012 commented 2 years ago

Hi all,

Environment: Spark 3.0.2, Koalas: 1.8.2, Delta Lake 0.7

I've a Delta-Table partioned by column "PARTITION". Koalas doesn't seem to execute predicate pushdown.

  1. Using Spark:
my_kdf = ks.read_delta(f"...")
my_df = my_kdf.to_spark()
result_df = my_df.filter((col("PARTITION") == 15) & (col("ID") == 1))
result_df.to_koalas().toPandas()

Takes: 20 seconds

  1. Same with koalas:

    result_kdf = ks.read_delta(f"...")
    result_kdf = result_kdf [(result_kdf ["PARTITION"] == 15) & (result_kdf ["ID"] == 1)]
    result_kdf.toPandas()

    Takes 130 seconds (seems that it doesnt execute predicate pushdown)

  2. Other try with koalas:

    my_kdf = ks.read_delta(f"...")
    result_kdf = my_kdf [(my_kdf ["PARTITION"] == 15)]
    result_kdf = result_kdf [(result_kdf ["ID"] == 1)]
    result_kdf.toPandas()

    Takes: 20 seconds.

Why takes 2. so long?

Thanks! Best

HyukjinKwon commented 2 years ago

@Lukas012 do you mind reporting a issue in https://issues.apache.org/jira/projects/SPARK?

Lukas012 commented 2 years ago

Why? This problem only occurs in koalas.

itholic commented 2 years ago

@Lukas012 Koalas is ported into PySpark under the name "pandas API on Spark", and this repository is only in maintenance mode. You can get faster feedback in Apache Spark community.

FYI: and also you can use Koalas code as is in the Apache Spark as below:

# import databricks.koalas as ks
import pyspark.pandas as ks

... (existing Koalas codes)