apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.1k stars 2.13k forks source link

Implement Human OAuth2 Flows for OAuth2Manager #10677

Open c-thiel opened 1 month ago

c-thiel commented 1 month ago

Feature Request / Improvement

There have been a few very valuable discussions on AuthN in the Iceberg Mailing list initiated by the Nessie Team resulting in https://github.com/apache/iceberg/pull/10603 and one of its follow-ups https://github.com/apache/iceberg/pull/10621 . Once refactoring in https://github.com/apache/iceberg/pull/10621 is complete, it would be great to extend the OAuth2Manager with additional flows - especially flows for human users.

Currently, on client side, only a generic Bearer token and the client-credential flow is supported. While the client-credential flow is great for unsupervised (scheduled) tasks, it is unsuitable to identify human users.

Currently Iceberg clients lack a widely supported authentication flow for human users. The only option currently would be to use any kind of flow outside of the Iceberg / Spark runtime, obtain a token, and pass it "as-is" to the client. This is a problem as tokens won't be refreshed and there is no way to pass a refreshed token to the client during processing. This effectively limits the processing length to the lifetime of a token, which is typically a few hours.

To circumvent this limitations, Catalog implementations are becoming IdPs themselves, issuing local API-Keys (tokens) with long lifetimes. This is a very bad trend and a serious security issue, as personal tokens in Catalogs might outlive their users in the IdP.

There are two widely supported flows that are suitable for human users. By nature, both are interactive.

I see two possible levels of integration:

  1. Authorization / Device Code Flow happens outside of Iceberg. Access and Refresh Token are passed to Iceberg where a new Session is initiated from them and the token is refreshed internally. (Refresh logic is identical to client-credential flow and thus already implemented)
  2. The flow is implemented natively, which comes with the new challenge of user interaction by iceberg - i.e. logging the device-code to stdout or opening a Browser.

As I am not Java expert, so I would leave the concrete decision if 2) is feasible in reasonable amount of time to someone more qualified than me. Once we reached a consensus here, I am happy to implement an identical solution for pyiceberg & iceberg-rust.

Query engine

None

snazy commented 1 month ago

Noting that @adutra and @jackye1995 are working on it.

adutra commented 1 month ago

I even have a prototype working. I am only waiting on the decisions around design and API. See #10621.

dimas-b commented 1 month ago

Interactive console-based sessions are certainly possible in shell application (e.g. Spark Shells). The Nessie API client (not Iceberg REST) supports that and it seems to work well.

The Device Flow is also usable in "headless" environments as long as the user is able to get the device code from the Iceberg session logs or STDOUT. The Browser can run on the user's local host in that case.