Open BTheunissen opened 2 months ago
Thanks for raising this issue @BTheunissen
botocore_session is helpful to make available to override in order to support automatically refreshable credentials for long-running jobs. ... This allows the user to elapse over the IAM role-chaining limitation of 1 hour, very useful for reading extremely large tables.
The ability to refresh AWS credentials is important for long-running jobs. Let's open a ticket to track this feature.
See this comment for the reason to deprecate botocore_session
.
I wonder if there's another way to implement automatically refreshable credentials without using botocore_session
.
@kevinjqliu Definitely fair enough that the reason for deprecation being that the catalog settings are generally exposed as a Dict[str, str]
and the botocore.Session object being passed in breaks this convention.
I'd be fine removing if the ticket to track credential refresh was written up, I'd take a crack at implementing it but honestly the workarounds I've had to do to support it for both the Python boto clients, and the underlying filesystem implementations is pretty hacky, there are some existing issues on the same topic open against the Arrow project as the guidance from AWS on properly supporting refreshable credentials is very spotty.
@BTheunissen +1, opened #1129 to track this feature. It can be hacky for now. This feature is generally nice to have for the project
@BTheunissen, I'm in the same situation as you, trying to use Pyiceberg with automatically refreshable AWS credentials. Would you be able to share how you made this work with the current version of Pyiceberg? The glue catalog picks up the session correctly, but it doesn't use it for accessing S3.
The glue catalog picks up the session correctly, but it doesn't use it for accessing S3.
you can either set glue and s3 credentials separately or use the unified AWS credential configs https://py.iceberg.apache.org/configuration/#unified-aws-credentials
The glue catalog picks up the session correctly, but it doesn't use it for accessing S3.
you can either set glue and s3 credentials separately or use the unified AWS credential configs https://py.iceberg.apache.org/configuration/#unified-aws-credentials
I'm setting botocore_session
(which is now deprecated), but S3 doesn't use it. The OP mentions that he had to make some pretty hacky workarounds to make the filesystem implementations pick up the botocore_session. I am hoping those workarounds could be shared here.
@cshenrik Sorry about the lateness, I actually did a small internal fork of the library and added the following logic to io/pyarrow
:
def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
if scheme in {"s3", "s3a", "s3n"}:
from pyarrow.fs import S3FileSystem
client_kwargs: Dict[str, Any] = {
"endpoint_override": self.properties.get(S3_ENDPOINT),
"access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
"secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
"session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN),
"region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
}
if proxy_uri := self.properties.get(S3_PROXY_URI):
client_kwargs["proxy_options"] = proxy_uri
if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT):
client_kwargs["connect_timeout"] = float(connect_timeout)
if role_arn := self.properties.get(AWS_ROLE_ARN):
client_kwargs["role_arn"] = role_arn
if session_name := self.properties.get(AWS_SESSION_NAME):
client_kwargs["session_name"] = session_name
return S3FileSystem(**client_kwargs)
Passing the role_arn and session_name will let the S3 File System automatically refresh the credentials of the AWS C++ client used by the PyArrow file system, pretty tedious but working so far!
Thanks for sharing that, @BTheunissen.
I have to call a bespoke webservice for retrieving AWS credentials, so I can't use that implementation directly, but it's still good to see what others did.
role_arn
and session_name
to pyarrow.fs.S3FileSystem
Passing the role_arn and session_name will let the S3 File System automatically refresh the credentials of the AWS C++ client used by the PyArrow file system, pretty tedious but working so far!
@BTheunissen do you know if passing the role_arn will automatically refresh S3 credentials for long running jobs?
For pyarrow doc just mentions
AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role.
Feature Request / Improvement
The AWS parameter
botocore_session
has been flagged as deprecated as of #922, and is due to be removed at Milestone 0.8.I'd like to request that this parameter is not deprecated, and I'd be happy to add a PR to bring the credential name in-line with the rest of the updated client configuration.
botocore_session
is helpful to make available to override in order to support automatically refreshable credentials for long-running jobs.For example in my project I have the following boto3 utility code:
Which can be used as follows:
This allows the user to elapse over the IAM role-chaining limitation of 1 hour, very useful for reading extremely large tables.
I'd also like to contribute some of this code upstream at some point to support refreshable botocore sessions in both the AWS Glue/DynamoDB clients, as well as the underlying S3 file system code.