apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[Python] GCS FileSystem - support federated identity #34595

Closed martin-traverse closed 1 year ago

martin-traverse commented 1 year ago

Describe the enhancement requested

Hello,

It seems that federated / external identities are not supported in the Arrow GcsFileSystem implementation? It would be great to support this. I'm using it in CI with GitHub as an IdP as per the instructions and it works great for the Google CLI tools.

I'm not sure how this is implemented, is there a standard library from Google that can just be linked / updated? The federated auth mechanism creates a regular Google application creds file, with the "type" set as "external_account". Is there an easy (ish) way to bring this in if the Google libs handle the different standard credential types? Or is that wishful thinking on my part?

I'm interested in the Python component but guess this would come to other languages that use the same underlying code if it was added.

Here is the stack trace from Arrow:

File "pyarrow/_fs.pyx", line 571, in pyarrow._fs.FileSystem.get_file_info File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: google::cloud::Status(INVALID_ARGUMENT: Permanent error GetObjectMetadata: Could not create a OAuth2 access token to authenticate the request. The request was not sent, as such an access token is required to complete the request successfully. Learn more about Google Cloud authentication at https://cloud.google.com/docs/authentication. The underlying error message was: Unsupported credential type (external_account) when reading Application Default Credentials file from [/path/to/credentials.json].). Detail: [errno 22] Invalid argument

Component(s)

Python

westonpace commented 1 year ago

CC @coryan perhaps?

coryan commented 1 year ago

Yup, I am probably the right person to ask about this. Workload identity federation is supported starting with v2.6.0:

https://github.com/googleapis/google-cloud-cpp/releases/tag/v2.6.0

We actually use the integration with GitHub actions in our own testing:

https://github.com/googleapis/google-cloud-cpp/actions/workflows/external-account-integration.yml

I am not sure how one goes about updating pyarrow to require google-cloud-cpp >= v2.6.0, I assume it depends on how it is getting installed?

HTH

westonpace commented 1 year ago

I am not sure how one goes about updating pyarrow to require google-cloud-cpp >= v2.6.0, I assume it depends on how it is getting installed?

In theory we just need to update this line to cover most users (e.g. pyarrow): https://github.com/apache/arrow/blob/f10f5cfd1376fb0e602334588b3f3624d41dee7d/cpp/thirdparty/versions.txt#L52

In practice things typically aren't that simple (e.g. breaking changes or changes in the way the gcs library is built).

coryan commented 1 year ago

I can give this a try later this month. FWIW, the only backwards incompatible change between 1.42.0 and 2.x is the requirement for C++14. Since arrow already requires C++17 that should be trivial. I do not recall any changes to the build requirements, but that is always risky.

I assume there is no way for @martin-traverse to upgrade before then?

martin-traverse commented 1 year ago

This is great, thanks so much for the quick turn-around! I look forward to simplifying our workflows when Arrow 12 is released :-)