Open MrPowers opened 10 months ago
now I have maybe a more useful use case, if you compare glaredb which uses datafusion and delta_rs vs datafusion with dataset generated by delta-rs python, there is a non trivial difference
now I have maybe a more useful use case, if you compare glaredb which uses datafusion and delta_rs vs datafusion with dataset generated by delta-rs python, there is a non trivial difference
@djouallah can you try out polars-deltalake and share if you see improvements there?
@ion-elgreco last time i checked polars did not support the full tpch SQL
@djouallah I mean using my Polars extension which does native reads with Polars engine: https://pypi.org/project/polars-deltalake/
@djouallah I mean using my Polars extension which does native reads with Polars engine: https://pypi.org/project/polars-deltalake/
I understand, the benchmarks uses SQL, polars has a limited sql support, so I can't run the test unfortunately yet :(
As mentioned by @djouallah in this PR, there are some queries where Parquet outperforms Delta Lake for DataFusion.
I mentioned in the thread how data for a certain query can be optimally distributed in a Parquet file and poorly distributed in a Delta table which might cause these differences.
In any case, I think it would be useful to have some benchmarks that show the performance differences of some queries on a Parquet file vs Delta Lake. The TPCH queries in this notebook seem like a reasonable starting point.
Some benchmarks showing some realistic end-to-end query patterns would be cool too, for example: