datafusion-contrib / datafusion-objectstore-s3

S3 as an ObjectStore for DataFusion
Apache License 2.0
59 stars 13 forks source link

remove the explicit bucket and derive from url like normal s3 #20

Closed seddonm1 closed 2 years ago

seddonm1 commented 2 years ago

As mentioned in https://github.com/datafusion-contrib/datafusion-objectstore-s3/issues/19 the current implementation is incorrect as we need to provide a specific bucket for the data to be read from.

This is incorrect as the DataFusion ObjectStoreRegistry registers by URI scheme (like s3:// or file://). This means it would be impossible to register s3:// sources from two buckets.

By encoding the bucket name into the file path as the first value before the / we can support the ObjectStoreRegistry requirements and multiple buckets.

seddonm1 commented 2 years ago

FYI this means you do this:

    let execution_config = ExecutionConfig::new().with_batch_size(32768);
    let mut execution_ctx = ExecutionContext::with_config(execution_config);
    execution_ctx.register_object_store(
        "s3",
        Arc::new(AmazonS3FileSystem::new(None, None, None, None, None, None).await),
    );
matthewmturner commented 2 years ago

Do you think we could test this by mapping the testing data twice as two different volumes so we could test two different "buckets".

Like:

--volume "$(pwd)/parquet-testing:/bucket_one" \
--volume "$(pwd)/parquet-testing:/bucket_two" \
seddonm1 commented 2 years ago

yes, i can add some more tests like you suggest.

seddonm1 commented 2 years ago

@matthewmturner I added a test. unfortunately given how minio works we can only mount one location:

ERROR Invalid command line arguments: Cross-device mounts detected on path (/data) at following locations [/data/data1 /data/data0]. Export path should not have any sub-mounts, refusing to start.

I have added a test that reads from the bad_data path and ensures the panic error message is relating to a corrupt parquet not a file not found type error.

matthewmturner commented 2 years ago

@seddonm1 ah thats unfortunate. but thats definitely a good workaround!

seddonm1 commented 2 years ago

@matthewmturner I have rebased so this is ready to merge once tests pass 👍

houqp commented 2 years ago

Good work :+1: Just a note for future improvements, it's very common to setup different access control for different buckets, so we will need to support creating different clients with specific configs for different buckets in the future. For example, in our production environment, we have spark jobs that access different buckets hosted in different AWS accounts.

seddonm1 commented 2 years ago

@houqp agreed and understood. Perhaps the ObjectStoreRegistry needs something more than URI scheme to allow for this use case?

houqp commented 2 years ago

Yeah, I can think of two different approaches to address this problem:

  1. Maintain a set of protocol specific clients internally within the S3 object store implementation for each bucket
  2. Extend ObjectStore abstraction in datafusion to support a hierarchy based object store lookup. i.e. first lookup a object store specific uri key generator by scheme, then calculate a unique object store key for given uri for the actual object store lookup.

I am leaning towards option 1 because it doesn't force this complexity into all object stores. For example, local file object store will never need to dispatch to different clients based on file path. @yjshen curious what's your thought on this.

matthewmturner commented 2 years ago

@houqp @seddonm1 for my info, can you provide more info on how access control is handled with configs? From my experience I've controlled access to buckets via AWS IAM policies and the attendant access_key / secret_key are linked to that. Are there other credential providers where that isnt the case? Or cases when its not that straight forward with IAM and access / secret keys?

houqp commented 2 years ago

IAM policy attached to IAM users (via access/secret key) is easier to get started with. For more secure and production ready setup, you would want to use IAM role instead of IAM users so there is no long lived secrets. The place where things get complicated is cross account S3 write access. In order to do this, you need to assume an IAM role in the S3 bucket owner account to perform the write, otherwise the bucket owner account won't be able to truly own the newly written objects. The result of that is the bucket owner won't be able to further share the objects with other accounts. In short, in some cases, the object store need to assume and switch to different IAM roles depending on which bucket it is writing to. For cross account S3 read, we don't have this problem, so you can usually get by with a single IAM role.

matthewmturner commented 2 years ago

@houqp makes sense, thanks for explanation!