apache / amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
https://amoro.apache.org/
Apache License 2.0
861 stars 285 forks source link

[Improvement][Trino]: Trino support hadoop proxy user #928

Closed shidayang closed 1 year ago

shidayang commented 1 year ago

Search before asking

What would you like to be improved?

The user originally had a permission system based on Ranger. The kerberos user currently configured by Trino is a super user with all permissions, which will invalidate the permission system. We need to support the use of superusers as trino users to access hdfs, so that the original permission system can be used.

How should we improve?

No response

Are you willing to submit PR?

Subtasks

No response

Code of Conduct

zhoujinsong commented 1 year ago

@shidayang Can you describe in more detail what goal would you like to achieve? I cannot understand this target:

We need to support the use of superusers as trino users to access hdfs, so that the original permission system can be used

xieyi888 commented 1 year ago

Because catalog user in ams is a super user in HDFS usually, and it have super authority for HDFS path corresponding to Arctic table. Without this function, all the trino query from different users access HDFS with this super user successfully, it makes the Ranger-HDFS invalid for trino users

xieyi888 commented 1 year ago

For example, the arctic catalog_A had configurated the kerberos principal in ams. and the arctic table catalog_A.DBA.test_tbl had the corresponding hdfs path /user/useA/hive_db/DBA.db/test_tbl/. HDFS path can't be accessible for userB in Ranger-HDFS.

Now, userB query arctic table in trino, will use userA to access HDFS,and query successfully.

With this function, userA will proxy userB to access HDFS, and HDFS-Ranger will intercept the query from userB as below, da_market is userB, analysis_test is UserA image

Caused by: org.apache.hadoop.ipc.RemoteException: Permission denied: user=da_market, access=EXECUTE, inode="/user/analysis_test/hive_db/****.db/****"
zhoujinsong commented 1 year ago

As far as I can see, you want to be able to use the user's account rather than the one in the catalog configuration when doing permission verification in Trino.

Both Spark and Flink have the same requirement. The difference is that they usually use the user they configured when launching.

However, in an MPP system such as Trino, each query may use a different user account. So proxy users may be more appropriate.

Do I understand you correctly?

zhoujinsong commented 1 year ago

In addition, Flink and Spark do this by passing the user into catalog properties they wish to use when creating the catalog. I wonder if Trino reload catalog or tables when executing SQL for different users. Can we pass the certified user from catalog properties too ?

xieyi888 commented 1 year ago

As far as I can see, you want to be able to use the user's account rather than the one in the catalog configuration when doing permission verification in Trino.

Both Spark and Flink have the same requirement. The difference is that they usually use the user they configured when launching.

However, in an MPP system such as Trino, each query may use a different user account. So proxy users may be more appropriate.

Do I understand you correctly?

Yes, each trino query may from different users, we want to use the configured super user to proxy these trino user, It achieve that HDFS can do permissions validation under existing Ranger system for different user.

shidayang commented 1 year ago

In addition, Flink and Spark do this by passing the user into catalog properties they wish to use when creating the catalog. I wonder if Trino reload catalog or tables when executing SQL for different users. Can we pass the certified user from catalog properties too ?

Which user to use is determined by Trino account system