apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.1k stars 915 forks source link

[Bug][AuthZ] Kyuubi has no permission to access the Iceberg metadata table after integrating Ranger #3924

Open MLikeWater opened 1 year ago

MLikeWater commented 1 year ago

Code of Conduct

Search before asking

Describe the bug

Environment

Spark version:3.2.2 Kyuubi version: apache-kyuubi-1.7.0-SNAPSHOT-bin (master)

./build/dist --tgz --spark-provided --flink-provided -Pspark-3.2

Iceberg version: 0.14.1

wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/0.14.1/iceberg-spark-runtime-3.2_2.12-0.14.1.jar

Perform SQL operations

use testdb;
CREATE TABLE testdb.iceberg_tbl (id bigint, data string) USING iceberg;
INSERT INTO testdb.iceberg_tbl VALUES (1, 'a'), (2, 'b'), (3, 'c');
select * from testdb.iceberg_tbl;
+-----+-------+
| id  | data  |
+-----+-------+
| 1   | a     |
| 2   | b     |
| 3   | c     |
+-----+-------+

SELECT * FROM testdb.iceberg_tbl.history;

22/12/07 17:16:37 ERROR ExecuteStatement: Error operating ExecuteStatement: org.apache.kyuubi.plugin.spark.authz.AccessControlException: Permission denied: user [test_user] does not have [select] privilege on [testdb.iceberg_tbl/history/made_current_at]
        at org.apache.kyuubi.plugin.spark.authz.ranger.SparkRangerAdminPlugin$.verify(SparkRangerAdminPlugin.scala:128)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5(RuleAuthorization.scala:94)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5$adapted(RuleAuthorization.scala:93)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.checkPrivileges(RuleAuthorization.scala:93)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:36)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:33)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
        at scala.collection.immutable.List.foldLeft(List.scala:91)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
        at scala.collection.immutable.List.foreach(List.scala:431)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:125)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:121)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:117)
        at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:135)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:153)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:150)
        at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
        at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:246)
        at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:215)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
        at org.apache.spark.sql.Dataset.toLocalIterator(Dataset.scala:3000)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$2.iterator(ExecuteStatement.scala:107)
        at org.apache.kyuubi.operation.IterableFetchIterator.<init>(FetchIterator.scala:78)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:106)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:98)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.org$apache$kyuubi$engine$spark$operation$ExecuteStatement$$executeStatement(ExecuteStatement.scala:90)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$3.run(ExecuteStatement.scala:149)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

For the Iceberg table, it is normal to query some metadata information, such as:

# history
0: jdbc:hive2://xx.xx.xx.xx:10011/default> SELECT * FROM shdw.iceberg_tbl.history;
+--------------------------+----------------------+------------+----------------------+
|     made_current_at      |     snapshot_id      | parent_id  | is_current_ancestor  |
+--------------------------+----------------------+------------+----------------------+
| 2022-05-09 10:58:35.835  | 6955843267870447517  | NULL       | true                 |
+--------------------------+----------------------+------------+----------------------+

# snapshots
0: jdbc:hive2://xx.xx.xx.xx:10011/default> SELECT * FROM shdw.iceberg_tbl.snapshots;
+--------------------------+----------------------+------------+------------+----------------------------------------------------+----------------------------------------------------+
|       committed_at       |     snapshot_id      | parent_id  | operation  |                   manifest_list                    |                      summary                       |
+--------------------------+----------------------+------------+------------+----------------------------------------------------+----------------------------------------------------+
| 2022-05-09 10:58:35.835  | 6955843267870447517  | NULL       | append     | hdfs://cluster1/tgwarehouse/shdw.db/iceberg_tbl/metadata/snap-6955843267870447517-1-e8206624-fbc3-4cf5-b2cb-2db672393253.avro | {"added-data-files":"3","added-files-size":"1929","added-records":"3","changed-partition-count":"1","spark.app.id":"spark-application-1652065040852","total-data-files":"3","total-delete-files":"0","total-equality-deletes":"0","total-files-size":"1929","total-position-deletes":"0","total-records":"3"} |
+--------------------------+----------------------+------------+------------+----------------------------------------------------+----------------------------------------------------+

# history join snapshot 
0: jdbc:hive2://xx.xx.xx.xx:10011/default> select
    h.made_current_at,
    s.operation,
    h.snapshot_id,
    h.is_current_ancestor,
    s.summary['spark.app.id']
from shdw.iceberg_tbl.history h
join shdw.iceberg_tbl.snapshots s
  on h.snapshot_id = s.snapshot_id
order by made_current_at
+--------------------------+------------+----------------------+----------------------+----------------------------------+
|     made_current_at      | operation  |     snapshot_id      | is_current_ancestor  |      summary[spark.app.id]       |
+--------------------------+------------+----------------------+----------------------+----------------------------------+
| 2022-05-09 10:58:35.835  | append     | 6955843267870447517  | true                 | spark-application-1652065040852  |
+--------------------------+------------+----------------------+----------------------+----------------------------------+

Affects Version(s)

1.7.0(master branch)

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

22/12/07 16:53:57 ERROR ExecuteStatement: Error operating ExecuteStatement: org.apache.kyuubi.plugin.spark.authz.AccessControlException: Permission denied: user [test_user] does not have [select] privilege on [testdb.foo/history/made_current_at]
        at org.apache.kyuubi.plugin.spark.authz.ranger.SparkRangerAdminPlugin$.verify(SparkRangerAdminPlugin.scala:128)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5(RuleAuthorization.scala:94)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5$adapted(RuleAuthorization.scala:93)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.checkPrivileges(RuleAuthorization.scala:93)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:36)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:33)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
        at scala.collection.immutable.List.foldLeft(List.scala:91)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
        at scala.collection.immutable.List.foreach(List.scala:431)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:125)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:121)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:117)
        at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:135)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:153)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:150)
        at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
        at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:246)
        at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:215)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
        at org.apache.spark.sql.Dataset.toLocalIterator(Dataset.scala:3000)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$2.iterator(ExecuteStatement.scala:107)
        at org.apache.kyuubi.operation.IterableFetchIterator.<init>(FetchIterator.scala:78)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:106)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:98)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.org$apache$kyuubi$engine$spark$operation$ExecuteStatement$$executeStatement(ExecuteStatement.scala:90)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$3.run(ExecuteStatement.scala:149)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Kyuubi Server Configurations

spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension,org.apache.kyuubi.plugin.spark.authz.ranger.RangerSparkExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type=hive

Kyuubi Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

bowenliang123 commented 1 year ago

cc @bowenliang123 @yaooqinn

bowenliang123 commented 1 year ago
image

I don't have a clue how to exclude the metadata tables like history/snapshots from the table identifier. As shown above, the table identifier from select * from iceberg_ns.owner_variable.history is Some(iceberg_ns.owner_variable.history). Whether possible way to check the table is a in Iceberg catalog and then skip the metadata tables.?

pan3793 commented 1 year ago

The metadata tables are enumerable, maybe we can hardcode convert the metadata tables' permission check to the data table?

bowenliang123 commented 1 year ago

The metadata tables are enumerable, maybe we can hardcode convert the metadata tables' permission check to the data table?

Yes, but first how to check the real table is an iceberg one?

MLikeWater commented 1 year ago

@pan3793 @bowenliang123 Thanks for your support. Different data lake technologies may have different metadata tables. It is possible to judge whether it is a Iceberg or Hudi table from the structure of the created table:

use testdb;
show create table iceberg_tbl;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE spark_catalog.testdb.iceberg_tbl (
  `id` BIGINT,
  `data` STRING)
USING iceberg
LOCATION 'hdfs://cluster1/tgwarehouse/testdb.db/iceberg_tbl'
TBLPROPERTIES(
  'current-snapshot-id' = '4900628243476923676',
  'format' = 'iceberg/parquet',
  'format-version' = '1')
 |
+----------------------------------------------------+
yaooqinn commented 1 year ago

why not just grant select privilege to the user who access testdb.iceberg_tbl.history?

yaooqinn commented 1 year ago

Is this case equivalent to the one that you visit a hive table while you don't have permission to access the HMS table or record, which stores its metadata?

In other words, if we have ALTER privileges to the raw table, we perform ALTER operation on it, and the metadata changes accordingly. This does not mean we need the ALTER privilege to the metadata directly, which results in an ability to falsify critical information.

MLikeWater commented 1 year ago

why not just grant select privilege to the user who access testdb.iceberg_tbl.history?

@yaooqinn The Iceberg metadata tables, such as history or snapshots, are not stored in Hive metastore, so they cannot be authorized by ranger.

bowenliang123 commented 1 year ago

why not just grant select privilege to the user who access testdb.iceberg_tbl.history?

This could be a workaround. But these tables are more like meta tables rather than metadata tables. For querying situations, these derived tables of source tables could be treated as part of table itself, just like the columns.

bowenliang123 commented 1 year ago
image

With further investigation, I think we could tell it's an HistoryTable from an Iceberg table for resolving this. SparkTable and HistoryTable are classes from Iceberg Spark plugin.

yaooqinn commented 1 year ago

For querying situations, these derived tables of source tables could be treated as part of table itself, just like the columns.

yes, this happens when you query the raw table, just like the role that metadata plays when you query a hive one, or indexes, snapshots, etc., which other databases may have.

MLikeWater commented 1 year ago

Personally, for the Iceberg and Hudi storage formats, the permissions should be simplified when accessing the metadata on the table, that is, the permissions to judge the table metadata depend on the permissions of the table. If the table has access permissions, the metadata should have access permissions. In addition, Ranger does not support the metadata of the data lake storage technology.

pan3793 commented 11 months ago

what's the behavior of Trino/Snowflake(or other popular products)?

liaoyt commented 6 months ago

Personally, for the Iceberg and Hudi storage formats, the permissions should be simplified when accessing the metadata on the table, that is, the permissions to judge the table metadata depend on the permissions of the table. If the table has access permissions, the metadata should have access permissions. In addition, Ranger does not support the metadata of the data lake storage technology.

agree and we are facing this issue too. maybe we can setup a configuration to decide whether to convert the metadata tables' permission check to the data table or not. saying introducing this as a feature instead of fixing a bug. cc @yaooqinn @pan3793 @bowenliang123