apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.43k stars 957 forks source link

[spark] Support table options via SQL conf for Spark Engine #4393

Closed xiangyuf closed 2 weeks ago

xiangyuf commented 3 weeks ago

Purpose

Linked issue: close #4371

In some cases, users may want to use spark time travel by setting properties like set spark.paimon.scan.tag-name=tag_3. However, this property will take effect globally if the spark job read multiple tables at the same time.

It would be better if we can support table options via sql conf for Spark Engine. So user can specify different time travel options for different table, like this: image

Tests

API and Format

Documentation

xiangyuf commented 3 weeks ago

@YannByron @Aitozi Hi, would you kindly review this

xiangyuf commented 3 weeks ago
Aitozi commented 3 weeks ago

Do we need to support the table with same name in different db/catalog? Just like flink's global option do. https://github.com/apache/paimon/pull/2104

JingsongLi commented 3 weeks ago

Do we need to support the table with same name in different db/catalog? Just like flink's global option do. #2104

I think we should find an unified way to unify Flink and Spark.

xiangyuf commented 3 weeks ago

Do we need to support the table with same name in different db/catalog? Just like flink's global option do. #2104

I think we should find an unified way to unify Flink and Spark.

@Aitozi @JingsongLi Thx for reply. +1 for unify this.

xiangyuf commented 3 weeks ago

@JingsongLi @Aitozi Hi, I've unified flink and spark to support both dynamic table options and global options: Global options format: Flink:{config_key} Spark: spark.paimon.{config_key}.

Table options format: Flink:paimon.${catalogName}.${dbName}.${tableName}.${config_key} Spark: spark.paimon.${dbName}.${tableName}.{config_key}

Dynamic table options will override global options if there are conflicts.

WDYT?

xiangyuf commented 2 weeks ago

@Aitozi I’ve updated the dynamic global options format for Flink as {config_key} instead of paimon.{config_key}

Aitozi commented 2 weeks ago

@Aitozi I’ve updated the dynamic global options format for Flink as {config_key} instead of paimon.{config_key}

Get it, LGTM

Zouxxyy commented 2 weeks ago

@JingsongLi @Aitozi Hi, I've unified flink and spark to support both dynamic table options and global options: Global options format: Flink:{config_key} Spark: spark.paimon.{config_key}.

Table options format: Flink:paimon.${catalogName}.${dbName}.${tableName}.${config_key} Spark: spark.paimon.${dbName}.${tableName}.{config_key}

Dynamic table options will override global options if there are conflicts.

WDYT?

Why flink contains ${catalogName}, but spark not

xiangyuf commented 2 weeks ago

@JingsongLi @Aitozi Hi, I've unified flink and spark to support both dynamic table options and global options: Global options format: Flink:{config_key} Spark: spark.paimon.{config_key}. Table options format: Flink:paimon.${catalogName}.${dbName}.${tableName}.${config_key} Spark: spark.paimon.${dbName}.${tableName}.{config_key} Dynamic table options will override global options if there are conflicts. WDYT?

Why flink contains ${catalogName}, but spark not

@Zouxxyy Updated Spark table option format as: spark.paimon.${catalogName}.${dbName}.${tableName}.${config_key}

xiangyuf commented 2 weeks ago

@Zouxxyy @JingsongLi CI has passed, please take a look.