Open sethwhite-sf opened 11 months ago
This is interesting, sounds more of spark's view handling issue than iceberg spark integration to the best of my understanding the spark temp views (resolved plan) get cached as well in the session catalog :
wondering where exactly the fix should be
@singhpk234 , rdd
( which view1 is created on) has a reference to the logical plan which has a reference to older versions of iceberg table. After cache expires, no one is refreshing older version table in rdd
. Iceberg should ideally refresh the tables irrespective of cache enabled or disabled.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
Apache Iceberg version
1.4.1 (latest release)
Query engine
Spark
Please describe the bug 🐞
We have found that temporary views that reference an iceberg table become stale when catalog caching is enabled: spark.sql.catalog.catalog-name.cache-enabled=true.
Initially, when a view is created
Dataset rdd = spark.read().format("iceberg").load("table1");
rdd.createOrReplaceTempView("view1");
The view and catalog cache reference the same org.apache.iceberg.Table object and the view reflects any changes that the application makes when it is queried:
spark.sql("SELECT * from view1").show(); // query returns latest state of the table
However, once cache expiry occurs (after 30 seconds by default when caching is enabled), subsequent updates to the table, such as
spark.sql("DELETE FROM table1 AS t WHERE t.id IS NULL");
cause a new entry for the table to be created in the cache and the view no longer sees any of the changes that are made---it becomes stale---because the view is still using the original org.apache.iceberg.Table object which references an Iceberg table snapshot that is now no longer current. The view and cache are no longer in sync.
spark.sql("SELECT * from view1").show(); // No longer returns latest state of the table
The unit test below illustrates the problem. The test fails when the default catalog caching is enabled.