Closed chitralverma closed 9 months ago
cc: @tustvold @alamb
We moved away from SDKs to keep the dependency footprint down and provide better consistency across the implementations
https://github.com/apache/arrow-rs/issues/2176
What would be the advantage of such a change?
Regarding dependency footprint: https://crates.io/crates/aws-sdk-s3/1.4.0/dependencies
aws-sdk-s3
comes with 13 AWS-specific crates (direct dependencies only, haven't looked at indirect ones), 9 of which are used for the API generator generator smithy
alone.
Also AWS is not the only implementation of the S3 API out there and I doubt they support features (like conditional operations) that other S3 implementations support.
@chitralverma what would be the benefit of using the official SDK? Are there features it supports that are not supported by the custom one?
I can imagine potentially some of the more advanced access controls or something
Since the ObjectStore is a trait, perhaps we could create a new crate (could start outside of this repo) that uses the official SDK that could be used by anyone who needs the extra features
I guess one benefit is getting stuff like https://github.com/apache/arrow-rs/issues/5140 for close to free
Closing as not planned, thank you all for the discussion
One of the limitations of the current object-store implementation is that its authentication support is more restrictive than the one provided by AWS, for example. https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-sourcing-external.html. How can this feature be added to the current implementation?
How can this feature be added to the current implementation?
Users can provide their own credential providers - https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.with_credentials
This could even be using the AWS SDK if they so wish
for users that are using those libraries via a Python wrapper, how should this work? I.e. object_store imported by polars used in Python
That is a question for the maintainers of the python wrapper
So it seems to me the summary of this ticket is:
Thus I propose (and I think this is what @tustvold is also proposing) for the first step is that someone implements the ObjectStore
interface using the official AWS S3 SDK outside of the object_store crate. This would mean something like
struct OfficalAmazonS3 {
...
}
impl ObjectStore for OfficialAmazonS3 {
// implementation based on aws sdk goes here
}
I won't have time to devote to this project (likely ever) but I think it would be straightforward for anyone with the need (the earlier versions of object_store were actually implemented this way).
Once that code is complete, I think we could then have a separate discussion about "is there willingness to maintain that implementation in the object_store
crate" (we would have to find some maintainer bandwidth to do so) to ease distribution
But the first step in all possible futures is for someone to implement OfficialAmazonS3
.
If it is just credentials, it would be sufficient to just implement a CredentialProvider, I believe datafusion-cli might already be doing this and is just a couple of lines of code.
If the desire it to use this from python, it will need a conversation with the maintainers of the python bindings on whether to take this approach or to expose CredentialProvider in the python interface so it can be used with boto.
A full reimplementation using the AWS SDK is possible, but is a non-trivial amount of work, and would need people to drive it as Andrew describes
- There is a legitamate need for certain authentication / other features that are not implemented in the object_store's re-implementation of the S3
Just FYI, I'm working on a project reqsign focused on implementing all the request signing features provided by AWS (and all other cloud providers). I believe it would be beneficial to have a separate signing crate instead of the entire AWS SDK.
It's being refactored to simplify its layout and significantly reduce its dependency tree. We can discuss the details later.
One specific problem we have is how we inject this logic when the object store is wrapped by a crate which exposes python bindings, for example delta-rs, polars or DataFusion-python
Let's say this crate would already exists, how can we integrate it with those frameworks above ?
Let's say this crate would already exists, how can we integrate it with those frameworks above ?
Reqsign will provide a Python binding, allowing the Python ecosystem to integrate with it directly. Like what we do for object_store
and arrow
.
Let's say this crate would already exists, how can we integrate it with those frameworks above ?
Either the same way datafusion-cli does by overriding the credential provider at construction time, or by the maintainers of said python wrapper adding first class support for credential providers. We already provide sufficient extension points for them to do this
It's being refactored to simplify its layout and significantly reduce its dependency tree
So long as we preserve existing functionality, including around things like the client configuration used to make credential requests, and someone is willing to drive this, no objections from me
@edmondop and I spoke about this at the NYC DataFUsion meetup yesterday
I think the usecase is that
polars
polars
does not provide any way to modify / configure s3 connections at runtimeIf they could control the source, they could use the existing object_store CredentialProvider
trait but they control neither the source or distribution
@edmondop mentioned something about some AWS SDK mechanism that calls out to an external program / process to get credentials. While less efficient this would allow someone to plug in whatever authentication mechanism they wanted without having to change the source code
Perhaps we can follow this model somehow and build a "get credentials from an external process / souce" into the object store to provide a way that did not require the aws sdk, nor a bug dependency tree, but could support arbitrary authentication 🤔
Thank you @alamb . My proposal is that we extend object_store to support an additional way of providing credentials, called credential_provider
.
This process is also implemented as a part of the AWS SDK as described here and have the strong benefit of keeping the object_store crate simple, but support users in providing their custom authentication mechanism.
ObjectStore could be then extended to use a custom environment variable and use the credential provider instead. The documentation from AWS provides the type of schema required by the SDK, which we can follow
{
"Version": 1,
"AccessKeyId": "an AWS access key",
"SecretAccessKey": "your AWS secret access key",
"SessionToken": "the AWS session token for temporary credentials",
"Expiration": "RFC3339 timestamp for when the credentials expire"
}
I am sure many have already implemented a custom credential provider, and the object store implementation would be minimal. It would allow people to use objecstore, and thefore datafusion and polars, with complex authentication mechanisms without polluting the objecstore crate
what do you tihnk @tustvold ? Is this solution acceptable? Can I open a new issue for it?
My only worry would be that users might provide "untrusted" user input to the config system, and this would then allow for a limited form of RCE.
Perhaps we could just provide a CredentialProvider that calls a process, and leave it to polars to hook it up based on a config? This would avoid surprises for people using object_store in a server context?
Edit: it might also be worthwhile filing a polars issue about making CredentialProvider overridable, as this might be generally useful for them
Edit edit: IIRC we provide a mechanism to use the Azure and GCP CLIs to source tokens, perhaps one option would be to add something similar for S3?
Perhaps we could just provide a CredentialProvider that calls a process, and leave it to polars to hook it up based on a config? This would avoid surprises for people using object_store in a server context?
Yes, I agree it should be something explicitly enabled
Edit: it might also be worthwhile filing a polars issue about making CredentialProvider overridable, as this might be generally useful for them
I think most users of Polars use the python API (aka are not compiling rust programs) which means Rust level APIs (if I understand what you are proposing) is not likely to satisfy the usecase for their users
Edit edit: IIRC we provide a mechanism to use the Azure and GCP CLIs to source tokens, perhaps one option would be to add something similar for S3?
That certainly sounds like a good idea to me (follow the same model as the others)
@edmondop -- can you pull the relevant discussion into a new (non closed) ticket ?
I think most users of Polars use the python API
I meant some way to expose it in the python API
Opened this https://github.com/apache/arrow-rs/issues/6422
FYI I have filed an upstream polars issue as this seems to be a fairly frequent source of these reports - https://github.com/pola-rs/polars/issues/18979
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Recently the official AWS SDK for Rust was made GA. https://aws.amazon.com/blogs/developer/announcing-general-availability-of-the-aws-sdk-for-rust/
This ticket is created to discuss the potential to refactor the existing s3 implementation in favour of using the new SDK.
Describe the solution you'd like remove the current HTTP(s) based implementation in favour of the official SDK https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/
Describe alternatives you've considered
Additional context