datafusion-contrib / datafusion-objectstore-s3

S3 as an ObjectStore for DataFusion
Apache License 2.0
59 stars 13 forks source link

Support creating client specific configs for different buckets #25

Open matthewmturner opened 2 years ago

matthewmturner commented 2 years ago

...it's very common to setup different access control for different buckets, so we will need to support creating different clients with specific configs for different buckets in the future. For example, in our production environment, we have spark jobs that access different buckets hosted in different AWS accounts.

Originally posted by @houqp in https://github.com/datafusion-contrib/datafusion-objectstore-s3/issues/20#issuecomment-1019425889

With context provided by @houqp:

IAM policy attached to IAM users (via access/secret key) is easier to get started with. For more secure and production ready setup, you would want to use IAM role instead of IAM users so there is no long lived secrets. The place where things get complicated is cross account S3 write access. In order to do this, you need to assume an IAM role in the S3 bucket owner account to perform the write, otherwise the bucket owner account won't be able to truly own the newly written objects. The result of that is the bucket owner won't be able to further share the objects with other accounts. In short, in some cases, the object store need to assume and switch to different IAM roles depending on which bucket it is writing to. For cross account S3 read, we don't have this problem, so you can usually get by with a single IAM role.

And potential designs also provided by @houqp:

  1. Maintain a set of protocol specific clients internally within the S3 object store implementation for each bucket

  2. Extend ObjectStore abstraction in datafusion to support a hierarchy based object store lookup. i.e. first lookup a object store specific uri key generator by scheme, then calculate a unique object store key for given uri for the actual object store lookup.

I am leaning towards option 1 because it doesn't force this complexity into all object stores. For example, local file object store will never need to dispatch to different clients based on file path. @yjshen curious what's your thought on this.

matthewmturner commented 2 years ago

@seddonm1 FYI created this to continue conversations on the topic.

Do you think that this should be a requirement before publishing the crate?

seddonm1 commented 2 years ago

@matthewmturner I think this is an edge-case but up to @houqp to answer.

Most users will never use this functionality so I think we can easily publish a 0.1 pending the DataFusion release then this can be added after.

matthewmturner commented 2 years ago

@seddonm1 i saw you raised https://github.com/awslabs/aws-sdk-rust/issues/425.
Would you like something like what was proposed to be added as a type of credentials provider?

houqp commented 2 years ago

@seddonm1 definitely not a blocker for crates.io release :) Just a feature we can work on later.

seddonm1 commented 2 years ago

@matthewmturner that request was around being able to access public buckets which is independent to this request

matthewmturner commented 2 years ago

@seddonm1 yes understood that its separate from this - just wasnt sure if you wanted to add a new issue for that functionality.