apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
458 stars 165 forks source link

Do not deprecate Botocore Session in upcoming release (0.8) #1104

Open BTheunissen opened 2 months ago

BTheunissen commented 2 months ago

Feature Request / Improvement

The AWS parameter botocore_session has been flagged as deprecated as of #922, and is due to be removed at Milestone 0.8.

I'd like to request that this parameter is not deprecated, and I'd be happy to add a PR to bring the credential name in-line with the rest of the updated client configuration. botocore_session is helpful to make available to override in order to support automatically refreshable credentials for long-running jobs.

For example in my project I have the following boto3 utility code:

from boto3 import Session
from botocore.credentials import (
    AssumeRoleCredentialFetcher,
    Credentials,
    DeferredRefreshableCredentials,
)
from botocore.session import Session as BotoSession

def get_refreshable_botocore_session(
    source_credentials: Credentials | None,
    assume_role_arn: str,
    role_session_name: str | None = None,
) -> BotoSession:
    """Get a refreshable botocore session for assuming a role."""
    if source_credentials is not None:
        boto3_session = Session(
            aws_access_key_id=source_credentials.access_key,
            aws_secret_access_key=source_credentials.secret_key,
            aws_session_token=source_credentials.token,
        )
    else:
        boto3_session = Session()

    extra_args = {}
    if role_session_name:
        extra_args["RoleSessionName"] = role_session_name
    fetcher = AssumeRoleCredentialFetcher(
        client_creator=boto3_session.client,
        source_credentials=source_credentials,
        role_arn=assume_role_arn,
        extra_args={},
    )
    refreshable_credentials = DeferredRefreshableCredentials(
        method="assume-role",
        refresh_using=fetcher.fetch_credentials,
    )
    botocore_session = BotoSession()
    botocore_session._credentials = refreshable_credentials  # noqa: SLF001
    return botocore_session

Which can be used as follows:

credentials = Credentials(
    access_key=client_access_key_id,
    secret_key=client_secret_access_key,
    token=client_session_token,
)
botocore_session = get_refreshable_botocore_session(
    source_credentials=credentials,
    assume_role_arn=self.config["client_iam_role_arn"],
)
catalog_properties["botocore_session"] = botocore_session
load_catalog(**catalog_properties)

This allows the user to elapse over the IAM role-chaining limitation of 1 hour, very useful for reading extremely large tables.

I'd also like to contribute some of this code upstream at some point to support refreshable botocore sessions in both the AWS Glue/DynamoDB clients, as well as the underlying S3 file system code.

kevinjqliu commented 2 months ago

Thanks for raising this issue @BTheunissen

botocore_session is helpful to make available to override in order to support automatically refreshable credentials for long-running jobs. ... This allows the user to elapse over the IAM role-chaining limitation of 1 hour, very useful for reading extremely large tables.

The ability to refresh AWS credentials is important for long-running jobs. Let's open a ticket to track this feature.

See this comment for the reason to deprecate botocore_session. I wonder if there's another way to implement automatically refreshable credentials without using botocore_session.

BTheunissen commented 2 months ago

@kevinjqliu Definitely fair enough that the reason for deprecation being that the catalog settings are generally exposed as a Dict[str, str] and the botocore.Session object being passed in breaks this convention.

I'd be fine removing if the ticket to track credential refresh was written up, I'd take a crack at implementing it but honestly the workarounds I've had to do to support it for both the Python boto clients, and the underlying filesystem implementations is pretty hacky, there are some existing issues on the same topic open against the Arrow project as the guidance from AWS on properly supporting refreshable credentials is very spotty.

kevinjqliu commented 2 months ago

@BTheunissen +1, opened #1129 to track this feature. It can be hacky for now. This feature is generally nice to have for the project

cshenrik commented 1 month ago

@BTheunissen, I'm in the same situation as you, trying to use Pyiceberg with automatically refreshable AWS credentials. Would you be able to share how you made this work with the current version of Pyiceberg? The glue catalog picks up the session correctly, but it doesn't use it for accessing S3.

kevinjqliu commented 1 month ago

The glue catalog picks up the session correctly, but it doesn't use it for accessing S3.

you can either set glue and s3 credentials separately or use the unified AWS credential configs https://py.iceberg.apache.org/configuration/#unified-aws-credentials

cshenrik commented 1 month ago

The glue catalog picks up the session correctly, but it doesn't use it for accessing S3.

you can either set glue and s3 credentials separately or use the unified AWS credential configs https://py.iceberg.apache.org/configuration/#unified-aws-credentials

I'm setting botocore_session (which is now deprecated), but S3 doesn't use it. The OP mentions that he had to make some pretty hacky workarounds to make the filesystem implementations pick up the botocore_session. I am hoping those workarounds could be shared here.

BTheunissen commented 1 month ago

@cshenrik Sorry about the lateness, I actually did a small internal fork of the library and added the following logic to io/pyarrow:

def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
        if scheme in {"s3", "s3a", "s3n"}:
            from pyarrow.fs import S3FileSystem

            client_kwargs: Dict[str, Any] = {
                "endpoint_override": self.properties.get(S3_ENDPOINT),
                "access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
                "secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
                "session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN),
                "region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
            }

            if proxy_uri := self.properties.get(S3_PROXY_URI):
                client_kwargs["proxy_options"] = proxy_uri

            if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT):
                client_kwargs["connect_timeout"] = float(connect_timeout)

            if role_arn := self.properties.get(AWS_ROLE_ARN):
                client_kwargs["role_arn"] = role_arn

            if session_name := self.properties.get(AWS_SESSION_NAME):
                client_kwargs["session_name"] = session_name

            return S3FileSystem(**client_kwargs)

Passing the role_arn and session_name will let the S3 File System automatically refresh the credentials of the AWS C++ client used by the PyArrow file system, pretty tedious but working so far!

cshenrik commented 1 month ago

Thanks for sharing that, @BTheunissen.

I have to call a bespoke webservice for retrieving AWS credentials, so I can't use that implementation directly, but it's still good to see what others did.

kevinjqliu commented 2 hours ago

1296 added the option to pass role_arn and session_name to pyarrow.fs.S3FileSystem

Passing the role_arn and session_name will let the S3 File System automatically refresh the credentials of the AWS C++ client used by the PyArrow file system, pretty tedious but working so far!

@BTheunissen do you know if passing the role_arn will automatically refresh S3 credentials for long running jobs?

For pyarrow doc just mentions

AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role.