delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

Delta Lake vs Parquet benchmarks for different query engines #2012

Open MrPowers opened 5 months ago

MrPowers commented 5 months ago

As mentioned by @djouallah in this PR, there are some queries where Parquet outperforms Delta Lake for DataFusion.

I mentioned in the thread how data for a certain query can be optimally distributed in a Parquet file and poorly distributed in a Delta table which might cause these differences.

In any case, I think it would be useful to have some benchmarks that show the performance differences of some queries on a Parquet file vs Delta Lake. The TPCH queries in this notebook seem like a reasonable starting point.

Some benchmarks showing some realistic end-to-end query patterns would be cool too, for example:

djouallah commented 4 months ago

now I have maybe a more useful use case, if you compare glaredb which uses datafusion and delta_rs vs datafusion with dataset generated by delta-rs python, there is a non trivial difference

image
ion-elgreco commented 3 months ago

now I have maybe a more useful use case, if you compare glaredb which uses datafusion and delta_rs vs datafusion with dataset generated by delta-rs python, there is a non trivial difference

image

@djouallah can you try out polars-deltalake and share if you see improvements there?

djouallah commented 3 months ago

@ion-elgreco last time i checked polars did not support the full tpch SQL

ion-elgreco commented 3 months ago

@djouallah I mean using my Polars extension which does native reads with Polars engine: https://pypi.org/project/polars-deltalake/

djouallah commented 3 months ago

@djouallah I mean using my Polars extension which does native reads with Polars engine: https://pypi.org/project/polars-deltalake/

I understand, the benchmarks uses SQL, polars has a limited sql support, so I can't run the test unfortunately yet :(