apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.54k stars 759 forks source link

object_store: Use the official AWS SDK for s3 interactions #5143

Closed chitralverma closed 9 months ago

chitralverma commented 10 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Recently the official AWS SDK for Rust was made GA. https://aws.amazon.com/blogs/developer/announcing-general-availability-of-the-aws-sdk-for-rust/

This ticket is created to discuss the potential to refactor the existing s3 implementation in favour of using the new SDK.

Describe the solution you'd like remove the current HTTP(s) based implementation in favour of the official SDK https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/

Describe alternatives you've considered

Additional context

chitralverma commented 10 months ago

cc: @tustvold @alamb

tustvold commented 10 months ago

We moved away from SDKs to keep the dependency footprint down and provide better consistency across the implementations

https://github.com/apache/arrow-rs/issues/2176

What would be the advantage of such a change?

crepererum commented 10 months ago

Regarding dependency footprint: https://crates.io/crates/aws-sdk-s3/1.4.0/dependencies

aws-sdk-s3 comes with 13 AWS-specific crates (direct dependencies only, haven't looked at indirect ones), 9 of which are used for the API generator generator smithy alone.


Also AWS is not the only implementation of the S3 API out there and I doubt they support features (like conditional operations) that other S3 implementations support.

alamb commented 10 months ago

@chitralverma what would be the benefit of using the official SDK? Are there features it supports that are not supported by the custom one?

I can imagine potentially some of the more advanced access controls or something

Since the ObjectStore is a trait, perhaps we could create a new crate (could start outside of this repo) that uses the official SDK that could be used by anyone who needs the extra features

kszlim commented 10 months ago

I guess one benefit is getting stuff like https://github.com/apache/arrow-rs/issues/5140 for close to free

tustvold commented 9 months ago

Closing as not planned, thank you all for the discussion

edmondop commented 1 month ago

One of the limitations of the current object-store implementation is that its authentication support is more restrictive than the one provided by AWS, for example. https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-sourcing-external.html. How can this feature be added to the current implementation?

tustvold commented 1 month ago

How can this feature be added to the current implementation?

Users can provide their own credential providers - https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.with_credentials

This could even be using the AWS SDK if they so wish

edmondop commented 1 month ago

for users that are using those libraries via a Python wrapper, how should this work? I.e. object_store imported by polars used in Python

tustvold commented 1 month ago

That is a question for the maintainers of the python wrapper

alamb commented 1 month ago

So it seems to me the summary of this ticket is:

  1. There is a legitamate need for certain authentication / other features that are not implemented in the object_store's re-implementation of the S3
  2. The "official" AWS s3 sdk provides these other features but at the expense of a significantly larger dependency set

Thus I propose (and I think this is what @tustvold is also proposing) for the first step is that someone implements the ObjectStore interface using the official AWS S3 SDK outside of the object_store crate. This would mean something like

struct OfficalAmazonS3 {
...
}

impl ObjectStore for OfficialAmazonS3 { 
 // implementation based on aws sdk goes here
}

I won't have time to devote to this project (likely ever) but I think it would be straightforward for anyone with the need (the earlier versions of object_store were actually implemented this way).

Once that code is complete, I think we could then have a separate discussion about "is there willingness to maintain that implementation in the object_store crate" (we would have to find some maintainer bandwidth to do so) to ease distribution

But the first step in all possible futures is for someone to implement OfficialAmazonS3.

tustvold commented 1 month ago

If it is just credentials, it would be sufficient to just implement a CredentialProvider, I believe datafusion-cli might already be doing this and is just a couple of lines of code.

If the desire it to use this from python, it will need a conversation with the maintainers of the python bindings on whether to take this approach or to expose CredentialProvider in the python interface so it can be used with boto.

A full reimplementation using the AWS SDK is possible, but is a non-trivial amount of work, and would need people to drive it as Andrew describes

Xuanwo commented 1 month ago
  1. There is a legitamate need for certain authentication / other features that are not implemented in the object_store's re-implementation of the S3

Just FYI, I'm working on a project reqsign focused on implementing all the request signing features provided by AWS (and all other cloud providers). I believe it would be beneficial to have a separate signing crate instead of the entire AWS SDK.

It's being refactored to simplify its layout and significantly reduce its dependency tree. We can discuss the details later.

edmondop commented 1 month ago

One specific problem we have is how we inject this logic when the object store is wrapped by a crate which exposes python bindings, for example delta-rs, polars or DataFusion-python

Let's say this crate would already exists, how can we integrate it with those frameworks above ?

Xuanwo commented 1 month ago

Let's say this crate would already exists, how can we integrate it with those frameworks above ?

Reqsign will provide a Python binding, allowing the Python ecosystem to integrate with it directly. Like what we do for object_store and arrow.

tustvold commented 1 month ago

Let's say this crate would already exists, how can we integrate it with those frameworks above ?

Either the same way datafusion-cli does by overriding the credential provider at construction time, or by the maintainers of said python wrapper adding first class support for credential providers. We already provide sufficient extension points for them to do this

It's being refactored to simplify its layout and significantly reduce its dependency tree

So long as we preserve existing functionality, including around things like the client configuration used to make credential requests, and someone is willing to drive this, no objections from me

alamb commented 3 weeks ago

@edmondop and I spoke about this at the NYC DataFUsion meetup yesterday

I think the usecase is that

  1. User uses object_store indirectly via polars
  2. polars does not provide any way to modify / configure s3 connections at runtime

If they could control the source, they could use the existing object_store CredentialProvider trait but they control neither the source or distribution

@edmondop mentioned something about some AWS SDK mechanism that calls out to an external program / process to get credentials. While less efficient this would allow someone to plug in whatever authentication mechanism they wanted without having to change the source code

Perhaps we can follow this model somehow and build a "get credentials from an external process / souce" into the object store to provide a way that did not require the aws sdk, nor a bug dependency tree, but could support arbitrary authentication 🤔

edmondop commented 3 weeks ago

Thank you @alamb . My proposal is that we extend object_store to support an additional way of providing credentials, called credential_provider.

This process is also implemented as a part of the AWS SDK as described here and have the strong benefit of keeping the object_store crate simple, but support users in providing their custom authentication mechanism.

ObjectStore could be then extended to use a custom environment variable and use the credential provider instead. The documentation from AWS provides the type of schema required by the SDK, which we can follow

{
    "Version": 1,
    "AccessKeyId": "an AWS access key",
    "SecretAccessKey": "your AWS secret access key",
    "SessionToken": "the AWS session token for temporary credentials", 
    "Expiration": "RFC3339 timestamp for when the credentials expire"
}  

I am sure many have already implemented a custom credential provider, and the object store implementation would be minimal. It would allow people to use objecstore, and thefore datafusion and polars, with complex authentication mechanisms without polluting the objecstore crate

what do you tihnk @tustvold ? Is this solution acceptable? Can I open a new issue for it?

tustvold commented 3 weeks ago

My only worry would be that users might provide "untrusted" user input to the config system, and this would then allow for a limited form of RCE.

Perhaps we could just provide a CredentialProvider that calls a process, and leave it to polars to hook it up based on a config? This would avoid surprises for people using object_store in a server context?

Edit: it might also be worthwhile filing a polars issue about making CredentialProvider overridable, as this might be generally useful for them

Edit edit: IIRC we provide a mechanism to use the Azure and GCP CLIs to source tokens, perhaps one option would be to add something similar for S3?

alamb commented 3 weeks ago

Perhaps we could just provide a CredentialProvider that calls a process, and leave it to polars to hook it up based on a config? This would avoid surprises for people using object_store in a server context?

Yes, I agree it should be something explicitly enabled

Edit: it might also be worthwhile filing a polars issue about making CredentialProvider overridable, as this might be generally useful for them

I think most users of Polars use the python API (aka are not compiling rust programs) which means Rust level APIs (if I understand what you are proposing) is not likely to satisfy the usecase for their users

Edit edit: IIRC we provide a mechanism to use the Azure and GCP CLIs to source tokens, perhaps one option would be to add something similar for S3?

That certainly sounds like a good idea to me (follow the same model as the others)

alamb commented 3 weeks ago

@edmondop -- can you pull the relevant discussion into a new (non closed) ticket ?

tustvold commented 3 weeks ago

I think most users of Polars use the python API

I meant some way to expose it in the python API

edmondop commented 3 weeks ago

Opened this https://github.com/apache/arrow-rs/issues/6422

tustvold commented 2 weeks ago

FYI I have filed an upstream polars issue as this seems to be a fairly frequent source of these reports - https://github.com/pola-rs/polars/issues/18979