[Umbrella] RangerSparkExtension support {OWNER} variable defined in Ranger Policy

zhouyifan279 commented 2 years ago

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Search before asking

[X] I have searched in the issues and found no similar issues.

Describe the proposal

Currently, if user does not have insert permission of all tables in a database, AccessControlException will be throw when user insert into a newly created table.

RangerBasePlugin uses {OWNER} variable to remove this limitation:

At Ranger Admim side, set {OWNER} variable in Ranger Policy Users field.

At Ranger Plugin side, due to the above policy, current user gets the specified permissions of any table he creates(owns). Ranger Plugin deals with {OWNER} variable in org.apache.ranger.plugin.policyevaluator.RangerDefaultPolicyItemEvaluator#matchUserGroupAndOwner:

Some works need to be done to support this feature in Kyuubi RangerSparkExtension.

Task list

[x] #3608
[ ] ~~#Get table owner by TableIdentifier~~
[x] #3666
[x] #3672
[ ] ~~#Cache table info to reduce Catalog method invocations~~
[x] #3675

Are you willing to submit PR?

[X] Yes. I can submit a PR independently to improve.
[ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
[ ] No. I cannot submit a PR at this time.

bowenliang123 commented 2 years ago

This would be helpful for applying {OWNER} to policies.

But two problems for considering ,

is it investigating the owner for all tables evn with no {OWNER} rules on them? This will cause a heavy CPU/RTT time to fetch this information, and the additional cache will leave more memory footprints to it.
what is the proper caching and evicting strategy for caching table owners? LTT or max cache counts will introduce worries for missed queries, whether they will be fetched again which could cause more action and load in 1.

zhouyifan279 commented 2 years ago

@bowenliang123 , thanks for your comment. Here are my thoughts about your questions:

is it investigating the owner for all tables evn with no {OWNER} rules on them? This will cause a heavy CPU/RTT time to fetch this information, and the additional cache will leave more memory footprints to it.

For SQL query (SELECT & DML), we can always get the owner of table from CatalogTable#owner or org.apache.spark.sql.connector.catalog.Table#properties.get("owner"). No extra fetch is introduced. For most SQL commands (DDL), table metadata is not fetched during SQL complie. We need to fetch table metadata. In most cases, only one table metadata is fetched, there should not be much CPU/RTT overhead.

2. what is the proper caching and evicting strategy for caching table owners? LTT or max cache counts will introduce worries for missed queries, whether they will be fetched again which could cause more action and load in 1.

Originally, I intended to cache table metadata because I wanted to fetch table metadata of DataSourceV2Relation. After deeper investigation, I found DataSourceV2Relation carrying table metadata in table field. So cache is not needed anymore.

bowenliang123 commented 2 years ago

Thanks for the investigation and explanation. Since the owner name is carried in CatalogTable in V1 and in table in V2Realtion, it's alright to use them without extra fetching action. And as of now we don't have to cache them in runtime, that's good enough for concrete implementation. @zhouyifan279

apache / kyuubi