Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
213 stars 19 forks source link

Load of Qbeast DataSource should use the same Snapshot #466

Open osopardo1 opened 22 hours ago

osopardo1 commented 22 hours ago

After realising the root cause of Qbeast-io/qbeast-spark#414 , we noticed that Non-Deterministic Queries on Spark can corrupt the indexing process.

Indexing with Qbeast has 3 main stages:

  1. Analysing the DataFrame
  2. Indexing the DataFrame
  3. Writing the DataFrame

Although the DataFrame is the same object, the data is not materialized until we execute an Action (show(), collect(), count()...). Each time a stage is completed, we force an execution of the DataFrame.

If the source data has changed between stages, the DataFrame should maintain it's initial state, unless is forced to do otherwise.

We should:

fpj commented 22 hours ago

I think the description above is not fully capturing the recent discussions because:

1- If the dataframe materializes upon individual invocations, then concurrent writes might induce different data across invocations. Using the same snapshot should prevent such a situation from happening and locking the state of the dataframe. 2- I'm still unclear on non-deterministic operations corrupting the index. Perhaps instead of stating like this, we should say that with a given query we observed an incorrect behavior.