apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.05k stars 900 forks source link

[Improvement] Table Relation Cache Feature Should be Configureable in Kyuubi Server #2857

Open packyan opened 2 years ago

packyan commented 2 years ago

Code of Conduct

Search before asking

What would you like to be improved?

My scenario is to use kyuubi as a replacement for hiveserver2. When kyuubi and hive are used at the same time, the same table may be modified by two different execution engines at the same time. Due to the metadata caching feature of SparkSQL, SparkSQLEngine cannot perceive the changes of the table in time.

For example, Kyuubi user use SparkSQL Engine query a table at fisrt, then hive user insert some record to it, back to the SparkSQL Engine, the kyuubi user query the table again, they will found nothing changes. Another example is, when Kyuubi user query a table at first, then the hive user truncate it, when the kyuubi user query this table again, SparkSQL throws exceptions as follow:

It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.

How should we improve?

Although there is many ways to solove this problem, such as refresh table before query it, set spark.sql.filesourceTableRelationCacheSize to zero when open a kyuubi session, these methods are not user friendly. We should provide a configuration to tooggle the TableRelationCache feature. Maybe in many scenarios, Kyuubi admin will choose to turn off the SparkSQL table relation cache feature to reduce user complaints.

Are you willing to submit PR?

cxzl25 commented 2 years ago

Kyuubi supports configuring Spark parameters in kyuubi-defaults.conf for example

spark.sql.filesourceTableRelationCacheSize 0

This way it will take effect on spark engine.

You can also add configuration in spark-defaults.conf

packyan commented 2 years ago

Kyuubi supports configuring Spark parameters in kyuubi-defaults.conf for example

spark.sql.filesourceTableRelationCacheSize 0

This way it will take effect on spark engine.

You can also add configuration in spark-defaults.conf

Yes, this is solution, but what I mean is maybe we should rewrite the spark configuration that often need to be modified into kyuubi configurations, so that kyuubi administrators who are not so familiar with the spark engine can better configure them.

Like #1018, this property can also be configured through spark-default, but it is now hard-coded into SparkSQL Engine, so I think these Spark configurations that are commonly used/modified in Kyuubi scenarios should be managed by some corresponding configuration on the Kyuubi side.

cc @yaooqinn @cxzl25

yaooqinn commented 2 years ago

I am OK to change spark defauts if it is reasonable for our use cases