Load of Qbeast DataSource should use the same Snapshot

Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!

Apache License 2.0

213 stars 19 forks source link

After realising the root cause of Qbeast-io/qbeast-spark#414 , we noticed that Non-Deterministic Queries on Spark can corrupt the indexing process.

Indexing with Qbeast has 3 main stages:

Analysing the DataFrame
Indexing the DataFrame
Writing the DataFrame

Although the DataFrame is the same object, the data is not materialized until we execute an Action (show(), collect(), count()...). Each time a stage is completed, we force an execution of the DataFrame.

If the source data has changed between stages, the DataFrame should maintain it's initial state, unless is forced to do otherwise.

We should:

Understand how and when the QbeastSnapshot is loaded.
Correct the behavior to use the same collection of files to load once initialized.

Qbeast-io / qbeast-spark

Load of Qbeast DataSource should use the same Snapshot #466