datafusion-contrib / datafusion-objectstore-s3

S3 as an ObjectStore for DataFusion
Apache License 2.0
59 stars 13 forks source link

Add load_from_env #14

Closed matthewmturner closed 2 years ago

houqp commented 2 years ago

So the official SDK's default credential chain look up is not putting env provided credentials at a higher priority? I remember this used to be the case for other languages :(

seddonm1 commented 2 years ago

Yes, I have played with the java SDK a bit and it's all a bit painful.

@matthewmturner maybe you create an enum (instead of these two current options) that can be passed to new_client so other auth could be added and is more explicit than my original approach?

matthewmturner commented 2 years ago

@houqp agree - i was a bit surprised as well.

@seddonm1 sure will do. i was actually just thinking of how to refactor that because of the clippy error.

for my info, what other types of auth have you seen besides that?

matthewmturner commented 2 years ago

@seddonm1 separately, minio seems to be taking a very long time in CI. I havent had a chance to look into it yet, but does anything seem off to you?

seddonm1 commented 2 years ago

I don't have access to my PC but the web identity token is very useful. See: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html

matthewmturner commented 2 years ago

I don't have access to my PC but the web identity token is very useful. See: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html

thx :) ive been living in my environment variable bubble.

seddonm1 commented 2 years ago

@matthewmturner those tokens are really easy to mount with AWS managed kubernetes so extremely useful

matthewmturner commented 2 years ago

@seddonm1 ok, i will look into that once i get the desired structure in place

matthewmturner commented 2 years ago

@seddonm1 @houqp FYI i added a new Authorizer enum and refactored a bit. would be great to get your thoughts.

seddonm1 commented 2 years ago

@matthewmturner I think this is on the right track. I think the enums should align with the AWS terms so I think Http may be the incorrect name. So I think the S3Context needs the enum embedded for credentials provider. I'm away this week so can help in about a week if that's ok?

matthewmturner commented 2 years ago

@matthewmturner I think this is on the right track. I think the enums should align with the AWS terms so I think Http may be the incorrect name. So I think the S3Context needs the enum embedded for credentials provider. I'm away this week so can help in about a week if that's ok?

aha i had a feeling it would change. i thought the S3Context might need the Authorizer as well, but it didnt seem necessary yet. what did you have in mind?

and for sure - no rush. any help is appreciated.

id like to figure out whats going on with minio in CI so that will probably be my next focus.

seddonm1 commented 2 years ago

Hi, sorry have had a chance to look at this again.

I think for now we could remove the minio tests and the current credentials builder and just use load_from_env() which eventually uses the default credentials provider list (not just env (environment variables): https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-config/src/default_provider.rs#L570

I will have a look at seeing how to best pass in a Credentials Provider (https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-config/src/lib.rs#L224) so we can reinstate the minio source.

I think this will make this meet the use case for most people and then I can work on this on the side.

seddonm1 commented 2 years ago

actually I think it should be like this. this allows taking the default values but overriding if we want to (for tests/non s3 source)

async fn new_client(
    credentials_provider: Option<SharedCredentialsProvider>,
    region: Option<Region>,
    endpoint: Option<Endpoint>,
    retry_config: Option<RetryConfig>,
    sleep: Option<Arc<dyn AsyncSleep>>,
    timeout_config: Option<TimeoutConfig>,
) -> aws_sdk_s3::Client {
    let config = aws_config::load_from_env().await;
    let mut config_builder = aws_sdk_s3::config::Builder::from(&config);

    if let Some(credentials_provider) = credentials_provider {
        config_builder = config_builder.credentials_provider(credentials_provider);
    }

    if let Some(region) = region {
        config_builder = config_builder.region(region);
    }

    if let Some(endpoint) = endpoint {
        config_builder = config_builder.endpoint_resolver(endpoint);
    }

    if let Some(retry_config) = retry_config {
        config_builder = config_builder.retry_config(retry_config);
    }

    if let Some(sleep) = sleep {
        config_builder = config_builder.sleep_impl(sleep);
    }

    if let Some(timeout_config) = timeout_config {
        config_builder = config_builder.timeout_config(timeout_config);
    };

    let config = config_builder.build();
    aws_sdk_s3::Client::from_conf(config)
}
matthewmturner commented 2 years ago

@seddonm1 Ok will check it out - thanks. Do you think CI isn't working because of credentials? I thought we were using dummy credentials and they would work locally and on GHA.

seddonm1 commented 2 years ago

@matthewmturner I can work on implementing this if its ok and ill fix CI at the same time.

matthewmturner commented 2 years ago

@seddonm1 Sounds good. I just got external errors enabled on datafusion side so I'll work on that.

seddonm1 commented 2 years ago

i have created https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/17 which will use the load_from_env() method by default but allow users to override individual builder arguments if desired. I think this PR is redundant if that one is merged.