Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
215 stars 19 forks source link

Error control + enforce columnStats when indexing non-deterministic or source-changing DataFrames #479

Open osopardo1 opened 2 days ago

osopardo1 commented 2 days ago

As a first solution for #466, we need to force users to add the columnStats when indexing Tables with the following characteristics:

ColumnStats would infer the data's min/max values before the DataFrame Analysis, which can produce inconsistent results when loading the DataFrame twice for Indexing in any of the above use cases.

osopardo1 commented 2 days ago

Before: Analyze to what extent is possible to know the determinism of a column/query in advance.