delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.64k stars 1.71k forks source link

[BUG] [Spark/Stanadalone] Delta Table load is slow #2259

Open munendrasn opened 1 year ago

munendrasn commented 1 year ago

Bug

We have a Delta table which has ~59 Delta commits, with latest version being 58, and last checkpoint created at 50. The table contains roughly 34K parquet files initial loading of the table is more than 10 sec, both when using delta-standalone, and delta-core. Is this expected? Is there way to optimise initial load time?

Which Delta project/connector is this regarding?

Describe the problem

Steps to reproduce

Total rows : 672600

snippet to measure the time

def time[A](f: => A) = {
        val s = System.nanoTime
       val ret = f
        println("time: "+(System.nanoTime-s)/1e6+"ms")
        ret
}

Spark Query

time {
       var df = spark.sql("select count(*) from delta.`s3://delta-lake/delta-table3`")
       df.show
}

time: 52866.640429ms

Normal load

import org.apache.spark.sql.delta.DeltaLog
time {
     var deltaLog = DeltaLog.forTable(spark, deltaLocation)
}

time: 13177.175263ms

Observed results

Expected results

Further details

Environment information

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

munendrasn commented 1 year ago

similar issue but related to refresh table https://github.com/delta-io/connectors/pull/533