Open osopardo1 opened 22 hours ago
I think the description above is not fully capturing the recent discussions because:
1- If the dataframe materializes upon individual invocations, then concurrent writes might induce different data across invocations. Using the same snapshot should prevent such a situation from happening and locking the state of the dataframe. 2- I'm still unclear on non-deterministic operations corrupting the index. Perhaps instead of stating like this, we should say that with a given query we observed an incorrect behavior.
After realising the root cause of Qbeast-io/qbeast-spark#414 , we noticed that Non-Deterministic Queries on Spark can corrupt the indexing process.
Indexing with Qbeast has 3 main stages:
Although the DataFrame is the same object, the data is not materialized until we execute an Action (show(), collect(), count()...). Each time a stage is completed, we force an execution of the DataFrame.
If the source data has changed between stages, the DataFrame should maintain it's initial state, unless is forced to do otherwise.
We should: