Closed grez72 closed 1 week ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 72%. Comparing base (
cedc6a6
) to head (2e7142b
). Report is 1 commits behind head on main.:exclamation: There is a different number of reports uploaded between BASE (cedc6a6) and HEAD (2e7142b). Click for more details.
HEAD has 15 uploads less than BASE
| Flag | BASE (cedc6a6) | HEAD (2e7142b) | |------|------|------| |unittests|6|1| |windows-2022|2|1| |3.10|3|0| |ubuntu-22.04|2|0| |3.9|3|1| |macos-13|2|0|
Hi @grez72,
Would you mind enabling "Allow edits by maintainers"? This will help us assist with any necessary updates directly. Please let us know if there’s anything further we can help with.
Thank you!
Hi @bhimrazy, I don't see an "Allow edits by maintainers" option. Perhaps this is because I made the pull request from our organization (harvard-visionlab) and not from my private account (grez72)? Should I create a new issue from a personal-account fork so I can allow edits by maintainers, or is there a way to "Allow edits by maintainers" from our harvard-visionlab/litdata fork?
Thanks for clarifying, @grez72!
In that case, could you please assist with the final updates based on @tchaton's suggestion? Here’s a proposed change that should help integrate self._storage_options
into the client setup:
- if has_shared_credentials_file or not _IS_IN_STUDIO or self._storage_options:
+ if has_shared_credentials_file or not _IS_IN_STUDIO
self._client = boto3.client(
"s3",
**{
"config": botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
**self._storage_options,
},
)
else:
provider = InstanceMetadataProvider(iam_role_fetcher=InstanceMetadataFetcher(timeout=3600, num_attempts=5))
credentials = provider.load()
self._client = boto3.client(
"s3",
aws_access_key_id=credentials.access_key,
aws_secret_access_key=credentials.secret_key,
aws_session_token=credentials.token,
config=botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
+ **self._storage_options,
)
I implemented the suggested change, but ran into a few issues when passing credentials as storage options like so:
dataset = StreamingDataset(
"s3://path/to/streaming-dataset,
storage_options=dict(aws_access_key_id=..., aws_secret_access_key=..., endpoint_url=...)
)
First I encountered this error (which makes sense, my storage options also contain aws_access_key_id).
boto3.client() got multiple values for keyword argument 'aws_access_key_id'
I next tried to merge the provider credentials and self._storage_options, like so:
provider = InstanceMetadataProvider(iam_role_fetcher=InstanceMetadataFetcher(timeout=3600, num_attempts=5))
credentials = provider.load()
self._client = boto3.client(
"s3",
**{
"config": botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
**dict(aws_access_key_id=credentials.access_key,
aws_secret_access_key=credentials.secret_key,
aws_session_token=credentials.token),
**self._storage_options
},
)
But here I get a permissions error because provider.load() provides a aws_session_token
that doesn't get overwritten by my storage_options (which only has the key_id and secret key). This could be avoided by passing aws_session_token=None with storage_options, but that seems kind of counterintutive from the user point of view.
I think maybe if the user passes storage options that include aws credentials (e.g., 'aws_access_key_id' or 'aws_session_token'), then none of the credentials returned by the provider are relevant, so maybe the following would be best? I'm not sure how the aws_session_token works (so maybe checking for that is not needed). If this seems reasonable I can make the change.
has_shared_credentials_file = (
os.getenv("AWS_SHARED_CREDENTIALS_FILE") == os.getenv("AWS_CONFIG_FILE") == "/.credentials/.aws_credentials"
)
storage_options_include_credentials = (
'aws_access_key_id' in self._storage_options or 'aws_session_token' in self._storage_options
)
if has_shared_credentials_file or not _IS_IN_STUDIO or (_IS_IN_STUDIO and storage_options_include_credentials):
self._client = boto3.client(
"s3",
**{
"config": botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
**self._storage_options,
},
)
else:
provider = InstanceMetadataProvider(iam_role_fetcher=InstanceMetadataFetcher(timeout=3600, num_attempts=5))
credentials = provider.load()
self._client = boto3.client(
"s3",
aws_access_key_id=credentials.access_key,
aws_secret_access_key=credentials.secret_key,
aws_session_token=credentials.token,
config=botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
)
Thank you, @grez72, for testing this and sharing such detailed feedback—it’s greatly appreciated! 🙌
I feel the initial approach might be more straightforward, avoiding any additional complexity. @tchaton, what are your thoughts on this?
Yes, fine by me ;)
What does this PR do?
This PR fixes an issue in the S3Client class where user-provided storage_options are ignored in the Lightning AI Studio environment. The client now uses user-supplied credentials when provided in storage_options.
Fixes #414