apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.09k stars 913 forks source link

[Umbrella] RangerSparkExtension support {OWNER} variable defined in Ranger Policy #3607

Closed zhouyifan279 closed 1 year ago

zhouyifan279 commented 2 years ago

Code of Conduct

Search before asking

Describe the proposal

Currently, if user does not have insert permission of all tables in a database, AccessControlException will be throw when user insert into a newly created table.

RangerBasePlugin uses {OWNER} variable to remove this limitation:

At Ranger Admim side, set {OWNER} variable in Ranger Policy Users field. image

At Ranger Plugin side, due to the above policy, current user gets the specified permissions of any table he creates(owns). Ranger Plugin deals with {OWNER} variable in org.apache.ranger.plugin.policyevaluator.RangerDefaultPolicyItemEvaluator#matchUserGroupAndOwner: image

Some works need to be done to support this feature in Kyuubi RangerSparkExtension.

Task list

Are you willing to submit PR?

bowenliang123 commented 2 years ago

This would be helpful for applying {OWNER} to policies.

But two problems for considering ,

  1. is it investigating the owner for all tables evn with no {OWNER} rules on them? This will cause a heavy CPU/RTT time to fetch this information, and the additional cache will leave more memory footprints to it.
  2. what is the proper caching and evicting strategy for caching table owners? LTT or max cache counts will introduce worries for missed queries, whether they will be fetched again which could cause more action and load in 1.
zhouyifan279 commented 2 years ago

@bowenliang123 , thanks for your comment. Here are my thoughts about your questions:

  1. is it investigating the owner for all tables evn with no {OWNER} rules on them? This will cause a heavy CPU/RTT time to fetch this information, and the additional cache will leave more memory footprints to it.

For SQL query (SELECT & DML), we can always get the owner of table from CatalogTable#owner or org.apache.spark.sql.connector.catalog.Table#properties.get("owner"). No extra fetch is introduced. For most SQL commands (DDL), table metadata is not fetched during SQL complie. We need to fetch table metadata. In most cases, only one table metadata is fetched, there should not be much CPU/RTT overhead.

2. what is the proper caching and evicting strategy for caching table owners? LTT or max cache counts will introduce worries for missed queries, whether they will be fetched again which could cause more action and load in 1.

Originally, I intended to cache table metadata because I wanted to fetch table metadata of DataSourceV2Relation. After deeper investigation, I found DataSourceV2Relation carrying table metadata in table field. So cache is not needed anymore.

bowenliang123 commented 2 years ago

Thanks for the investigation and explanation. Since the owner name is carried in CatalogTable in V1 and in table in V2Realtion, it's alright to use them without extra fetching action. And as of now we don't have to cache them in runtime, that's good enough for concrete implementation. @zhouyifan279