Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.
Apache License 2.0
375 stars 43 forks source link

Change S3Client to use user-provided storage_options even in Studio #415

Closed grez72 closed 1 week ago

grez72 commented 2 weeks ago

What does this PR do?

This PR fixes an issue in the S3Client class where user-provided storage_options are ignored in the Lightning AI Studio environment. The client now uses user-supplied credentials when provided in storage_options.

Fixes #414

codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 72%. Comparing base (cedc6a6) to head (2e7142b). Report is 1 commits behind head on main.

:exclamation: There is a different number of reports uploaded between BASE (cedc6a6) and HEAD (2e7142b). Click for more details.

HEAD has 15 uploads less than BASE | Flag | BASE (cedc6a6) | HEAD (2e7142b) | |------|------|------| |unittests|6|1| |windows-2022|2|1| |3.10|3|0| |ubuntu-22.04|2|0| |3.9|3|1| |macos-13|2|0|
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #415 +/- ## ==================================== - Coverage 78% 72% -7% ==================================== Files 34 34 Lines 5056 5042 -14 ==================================== - Hits 3964 3618 -346 - Misses 1092 1424 +332 ```
bhimrazy commented 2 weeks ago

Hi @grez72,

Would you mind enabling "Allow edits by maintainers"? This will help us assist with any necessary updates directly. Please let us know if there’s anything further we can help with.

Thank you!

grez72 commented 2 weeks ago

Hi @bhimrazy, I don't see an "Allow edits by maintainers" option. Perhaps this is because I made the pull request from our organization (harvard-visionlab) and not from my private account (grez72)? Should I create a new issue from a personal-account fork so I can allow edits by maintainers, or is there a way to "Allow edits by maintainers" from our harvard-visionlab/litdata fork?

bhimrazy commented 2 weeks ago

Thanks for clarifying, @grez72!

In that case, could you please assist with the final updates based on @tchaton's suggestion? Here’s a proposed change that should help integrate self._storage_options into the client setup:

-      if has_shared_credentials_file or not _IS_IN_STUDIO or self._storage_options:
+      if has_shared_credentials_file or not _IS_IN_STUDIO
            self._client = boto3.client(
                "s3",
                **{
                    "config": botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
                    **self._storage_options,
                },
            )
        else:
            provider = InstanceMetadataProvider(iam_role_fetcher=InstanceMetadataFetcher(timeout=3600, num_attempts=5))
            credentials = provider.load()
            self._client = boto3.client(
                "s3",
                aws_access_key_id=credentials.access_key,
                aws_secret_access_key=credentials.secret_key,
                aws_session_token=credentials.token,
                config=botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
+              **self._storage_options,
            )
grez72 commented 1 week ago

I implemented the suggested change, but ran into a few issues when passing credentials as storage options like so:

dataset = StreamingDataset(
    "s3://path/to/streaming-dataset,
    storage_options=dict(aws_access_key_id=..., aws_secret_access_key=..., endpoint_url=...)
)

First I encountered this error (which makes sense, my storage options also contain aws_access_key_id).

boto3.client() got multiple values for keyword argument 'aws_access_key_id'

I next tried to merge the provider credentials and self._storage_options, like so:

provider = InstanceMetadataProvider(iam_role_fetcher=InstanceMetadataFetcher(timeout=3600, num_attempts=5))
            credentials = provider.load()            
            self._client = boto3.client(
                "s3",
                **{
                   "config": botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
                    **dict(aws_access_key_id=credentials.access_key,
                          aws_secret_access_key=credentials.secret_key,
                          aws_session_token=credentials.token), 
                    **self._storage_options
                },
            )

But here I get a permissions error because provider.load() provides a aws_session_token that doesn't get overwritten by my storage_options (which only has the key_id and secret key). This could be avoided by passing aws_session_token=None with storage_options, but that seems kind of counterintutive from the user point of view.

I think maybe if the user passes storage options that include aws credentials (e.g., 'aws_access_key_id' or 'aws_session_token'), then none of the credentials returned by the provider are relevant, so maybe the following would be best? I'm not sure how the aws_session_token works (so maybe checking for that is not needed). If this seems reasonable I can make the change.

has_shared_credentials_file = (
            os.getenv("AWS_SHARED_CREDENTIALS_FILE") == os.getenv("AWS_CONFIG_FILE") == "/.credentials/.aws_credentials"
        )

        storage_options_include_credentials = (
            'aws_access_key_id' in self._storage_options or 'aws_session_token' in self._storage_options
        )

        if has_shared_credentials_file or not _IS_IN_STUDIO or (_IS_IN_STUDIO and storage_options_include_credentials):
            self._client = boto3.client(
                "s3",
                **{
                    "config": botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
                    **self._storage_options,
                },
            )
        else:
            provider = InstanceMetadataProvider(iam_role_fetcher=InstanceMetadataFetcher(timeout=3600, num_attempts=5))
            credentials = provider.load()
            self._client = boto3.client(
                "s3",
                aws_access_key_id=credentials.access_key,
                aws_secret_access_key=credentials.secret_key,
                aws_session_token=credentials.token,
                config=botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
            )
bhimrazy commented 1 week ago

Thank you, @grez72, for testing this and sharing such detailed feedback—it’s greatly appreciated! 🙌

I feel the initial approach might be more straightforward, avoiding any additional complexity. @tchaton, what are your thoughts on this?

tchaton commented 1 week ago

Yes, fine by me ;)