Delta Lake vs Parquet benchmarks for different query engines

delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python

https://delta-io.github.io/delta-rs/

Apache License 2.0

2.34k stars 413 forks source link

Delta Lake vs Parquet benchmarks for different query engines #2012

Open MrPowers opened 10 months ago

MrPowers commented 10 months ago

As mentioned by @djouallah in this PR, there are some queries where Parquet outperforms Delta Lake for DataFusion.

I mentioned in the thread how data for a certain query can be optimally distributed in a Parquet file and poorly distributed in a Delta table which might cause these differences.

In any case, I think it would be useful to have some benchmarks that show the performance differences of some queries on a Parquet file vs Delta Lake. The TPCH queries in this notebook seem like a reasonable starting point.

Some benchmarks showing some realistic end-to-end query patterns would be cool too, for example:

convert a CSV file to Parquet / Delta Lake
Delete some rows
Upsert some data
Run a query

djouallah commented 9 months ago

now I have maybe a more useful use case, if you compare glaredb which uses datafusion and delta_rs vs datafusion with dataset generated by delta-rs python, there is a non trivial difference

ion-elgreco commented 8 months ago

now I have maybe a more useful use case, if you compare glaredb which uses datafusion and delta_rs vs datafusion with dataset generated by delta-rs python, there is a non trivial difference

@djouallah can you try out polars-deltalake and share if you see improvements there?

djouallah commented 8 months ago

@ion-elgreco last time i checked polars did not support the full tpch SQL

ion-elgreco commented 8 months ago

@djouallah I mean using my Polars extension which does native reads with Polars engine: https://pypi.org/project/polars-deltalake/

djouallah commented 8 months ago

@djouallah I mean using my Polars extension which does native reads with Polars engine: https://pypi.org/project/polars-deltalake/

I understand, the benchmarks uses SQL, polars has a limited sql support, so I can't run the test unfortunately yet :(