Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
210 stars 19 forks source link

Metadata time in queries with Qbeast Datasource is higher than expected #320

Closed osopardo1 closed 1 month ago

osopardo1 commented 5 months ago

Investigating in the Spark UI with simple queries, we detected that the Metadata time for Qbeast datasource is bigger than expected.

Here's a comparison of a small (10 element) dataset read with Delta and Parquet:

Parquet

image

Delta

image

Qbeast

image

While Delta an Parquet spent only 2ms on Metadata time, Qbeast wasted 593ms. And this is for a small dataset, but the situation could get worsen specially in high-append scenarios.

I've checked the Execution Plan and the configuration, and does not seem to have much difference asides from the Index used.

Further investigation is needed. Will keep the conversation going on this issue.

osopardo1 commented 2 months ago

I think this issue is related to #335

osopardo1 commented 1 month ago

Fixed with #335