apache / polaris

Apache Polaris, the interoperable, open source catalog for Apache Iceberg
https://polaris.apache.org/
Apache License 2.0
1.13k stars 124 forks source link

Fix register_table to properly initialize FileIO and refactor overall FileIO initialization to better reveal bugs #208

Closed dennishuo closed 2 months ago

dennishuo commented 2 months ago

Description

The previous default initialization of a catalogFileIO that doesn't take into consideration any kind of subscoping is fundamentally a bug if it ever kicks in for real isolated-server scenarios where "application defaults" aren't appropriate for any storage interactions. Some unittest scenarios relied on this "default" initialization but this masked bugs where methods like registerTable really do need to go through the full subscoping flow, but silently "succeeded" in test cases just because the application default credentials leaked into the test environment.

Refactoring of FileIO initialization in BasePolarisCatalog now makes it more explicit as to when we expect to use defaults or when we require going through subscoping flows.

Conveniently, this also provides a clean way to configure a server to skip all subscoping flows entirely and always allow application defaults and/or table-config inheritance to define the credentials used to interact with storage. (E.g. provides a way to achieve https://github.com/apache/polaris/pull/77/files just by setting SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true)

Overview:

Fixes https://github.com/apache/polaris/issues/156 and https://github.com/apache/polaris/pull/77

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Test Configuration:

Checklist:

Please delete options that are not relevant.

dennishuo commented 2 months ago

@cgpoh if you have time, could you please verify whether SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION: true satisfies your use case as an alternative to https://github.com/apache/polaris/pull/77 ?

@collado-mike you might be interested in validating whether my changes to dropTable(purge) make sense; as far as I can tell since TaskFileIOSupplier is what ultimately provides purge credentials, the inline credential-creation was extraneous

cgpoh commented 2 months ago

@dennishuo , thanks! I will try to verify when it's merge

cgpoh commented 2 months ago

Hi @dennishuo, I just tested and the fix doesn't satisfy my use case. For my use case, the Spark job should not provide Azure credentials as environment variables for it to execute properly. I don't want our Azure credentials to be exposed through Spark Job and hence, I prefer the credential vending mode where Polaris Catalog is the only place that knows about Azure credentials and we can apply RBAC policy to Spark job via spark.sql.catalog.catalog_name.credential

dennishuo commented 2 months ago

@cgpoh Thanks for the extra info! To confirm, you still want Polaris to provide Spark with the ability to access Azure storage, but can't use SAS_TOKEN, right? Did you want Polaris to return something like a "shared key" to Spark instead? Were you also going to add this proposed new credential-vending approach to https://github.com/apache/polaris/pull/77 ?

cgpoh commented 2 months ago

@dennishuo , yes. My company disable the permission for us to use SAS_TOKEN and thus we need to fallback to use environment variables for ADLS. Sorry, what is the proposed new credential vending approach?