apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.16k stars 3.45k forks source link

[C++] [Python] Add functionality of `STSProfileCredentialsProvider` to default credentials chain for `S3FileSystem` #41794

Open fjetter opened 3 months ago

fjetter commented 3 months ago

Describe the enhancement requested

Given a typical AWS credentials setup that defines IAM roles like the following

# ~/.aws/config
[default]
region=us-east-2
role_arn=arn:aws:iam::123456789012:role/RoleName
source_profile=default

# ~/.aws/credentials
[default]
aws_access_key_id=XXXXXXXXXXX
aws_secret_access_key=YYYYYYYYYYYY

almost all AWS sdks are interpreting this correctly as an assume-role method that generates a temporary STS token pair.

For example, using python this looks like

import boto3
b3sess = boto3.Session()
creds = b3sess.get_credentials()
{
    "method": creds.method,
    "secret": creds.secret_key[:5] + "...",
    "token": creds.token[:5] + "...",
}

{'method': 'assume-role', 'secret': 'jALbI...', 'token': 'IQoJb...'}

The C++ sdk is deviating from how the default credentials chain is implemented and is not supporting this kind of configuration but instead uses the plain access key + secret key pair that is found in the configuration which does not necessarily provide sufficient permissions.

Dask adopted the S3FileSystem as a more performant alternative to the existing default fsspec filesystem for its parquet reader but this lack of support in the C++ sdk is a bit of a nasty blocker for further adoption. We ended up writing a workaround for our benchmarking by using boto to read the credentials and initialize the S3FileSystem manually but this has a couple of flaws. For starters, this is pretty unergonomic and nontrivial but more importantly this prohibits the refresh of the token after expiration (max duration is 1hr)

There's been some discussion on the aws-sdk-cpp repo about this with a suggestion to implement an amended credentials chain, see here that includes the STSProfileCredentialsProvider but it's also pointed out that this is flawed as well.

Also related

I know this is ultimately a aws-sdk-cpp problem but end users of the arrow S3FileSystem do not have this transparency and expect things to "just work", particularly when consuming the python API and they are used from how boto and other libraries are parsing credentials.

cc @pitrou since you've been poking in this area recently

Component(s)

C++, Python

pitrou commented 3 months ago

I know this is ultimately a aws-sdk-cpp problem but end users of the arrow S3FileSystem do not have this transparency and expect things to "just work", particularly when consuming the python API and they are used from how boto and other libraries are parsing credentials.

Which concrete resolution would you recommend we apply in our S3FileSystem implementation?

fjetter commented 3 months ago

Which concrete resolution would you recommend we apply in our S3FileSystem implementation?

I have to admit that I haven't understood exactly what the flaws of the new toolchain is. I have a prototype locally and can confirm that the simple new toolchain as suggested here https://github.com/aws/aws-sdk-cpp/issues/150#issuecomment-538548438 is a little buggy and I couldn't get it to working (I got it compiling / running but the way they parse the config is somehow different / broken).

I also saw that in dynamodb-shell they implemented a fix for this and they are also describing how the simple chain wasn't sufficient but the PR is a little complex and I haven't had the time to look through it, see https://github.com/awslabs/dynamodb-shell/pull/13

pitrou commented 3 months ago

I will admit to understanding nothing about this, but we do expose role-related parameters in S3FileSystem (e.G. role_arn). Does that not work for this?

fjetter commented 3 months ago

If one provides a role_arn and a session_name it does work as expected because it is not using the default credentials provider chain and is in fact not even loading the aws config.

The issue is only ocuring if one is initializing the filesystem with its default constructor which I assume is the most frequently used way to use the filesystem. Many users will not want to remember or copy the ARN

Brainstorming possibilities to address this...