Issue #16 is not solved by pull request #23; support AWS_WEB_IDENTITY_TOKEN_FILE as credential provider on EKS

duckdb / duckdb_aws

MIT License

34 stars 12 forks source link

Issue #16 is not solved by pull request #23; support AWS_WEB_IDENTITY_TOKEN_FILE as credential provider on EKS #31

Open joegoggins opened 4 months ago

joegoggins commented 4 months ago

https://github.com/duckdb/duckdb_aws/issues/16 was marked closed on Jan 2 as part of https://github.com/duckdb/duckdb_aws/pull/23 being merged. Unfortunately, the solution described in the PR body did not work, i.e. #16 does not seem fixed to me. I didn't see any of the authors/commenters of that issue confirm the fix either, so I figured I'd file a bug report here.

Repro

Re-read #16. The problem/repro described here is the same, but with more technical detail that I hoping helps to solve the underlying problem.

Open a pod shell on EKS that defines the following vars:

AWS_DEFAULT_REGION=us-east-1
AWS_REGION=us-east-1
AWS_ROLE_ARN=arn:aws:iam::123456789:role/example-role
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_STS_REGIONAL_ENDPOINTS=regional

Run aws s3 ls s3://example-bucket/example.parquet, observe successful output like this:

2024-02-16 01:02:32    4941078 example.parquet

This proves AWS CLI / pod perms are good.

Run ./duckdb -unredacted -json to open duckdb console:

Create a secret:

CREATE SECRET (
    TYPE S3,
    PROVIDER credential_chain
);

it takes a few seconds here and comes back successful.

Run select secret_string from duckdb_secrets(redact=false); Observe that the expected data is there -- key_id, secret, and session_token are set with values that look legit.

Run a query against the s3 bucket: SELECT * FROM read_parquet('s3://example-bucket/example.parquet');

Observe an error like this:

Error: IO Error: Connection error for HTTP HEAD to 'https://example-bucket./example.parquet'

Note; there is no 403 error here. It looks like somehow the endpoint is not getting set correctly--that https endpoint is clearly not valid. I tried specifying ENDPOINT to 's3.amazonaws.com' in the create secret command, but then I get an error like this:

Error: HTTP Error: HTTP GET error on 'https://jexample-bucket.s3.amazonaws.com/example.parquet' (HTTP 403)

I've tried lots of different permutations to get around this problem to no avail, here is a list of of other things I've tried/learned:

The problem happens using duckdb@0.10.0 in python as well. Similar repro, same error:

import duckdb
conn = duckdb.connect()
conn.execute("INSTALL aws;")
conn.execute("LOAD aws;")
conn.execute("INSTALL httpfs;")
conn.execute("LOAD httpfs;")

s3_access_query = """
CREATE SECRET (
    TYPE S3,
    PROVIDER credential_chain
);
"""
conn.execute(s3_access_query)
query = "SELECT * FROM read_parquet('s3://example-bucket/example.parquet');"
print(conn.execute(query).fetchall())

Using CHAIN 'config;env', CHAIN sts;instance, or other variants don't fix the problem.

joegoggins commented 4 months ago

@cpaika @j-hartshorn @osalloum I was wondering if you had a chance to validate the #16 fix from PR #23?

Do you experience the same issues I'm seeing here or did your problems go away? If you were able to get things working, could you share the code you used to make it work?

duckdb authors: Let me know if you need any additional info to help troubleshoot.

samansmink commented 4 months ago

Hey @joegoggins Thanks for opening the issue, I haven't really had time to investigate this properly so this is very helpful.

Setting up some test infrastructure with AWS that can be used by the aws extension CI to properly test the credential chain provider is definitely on my TODO list

joegoggins commented 4 months ago

Thanks for the quick reply @samansmink. I'll provide a few more ideas that might allow you to repro/fix this issue quickly. On our side, it is high priority that we solve this, so let me know if you want additional details/want to jump on a call to help/etc:

The same program that uses duckdb also uses awswrangler, which does AWS permissions correctly, I'm able to access S3 this way but not with duckdb. Perhaps you could analyze how that module integrates with the AWS SDK and identify the root problem/solution.
My guess on cause of bug within duckdb is that the downstream code that deals with the outcome of AWS_STS_REGIONAL_ENDPOINTS=regional isn't working correctly. It seems like there could be an interpolation error where the regional endpoint is empty, so when it gets interpolated into the endpoint to hit s3, it is wrong i.e. https://example-bucket./example.parquet, when i'd expect something like https://example-bucket.us-east-1.s3.amazon.com/example.parquet
Setting up some test infrastructure with AWS that can be used by the aws extension CI to properly test the credential chain provider is definitely on my TODO list

Before tinkering with CI, I'd recommend trying a scrappy approach to get the basic repro down and get your head around how the eks.amazonaws.com/role-arn annotation works. A path could be:

Setup your EKS cluster and base S3 IAM role using this guide from AWS: https://aws.amazon.com/blogs/containers/diving-into-iam-roles-for-service-accounts. You may also find this useful: https://docs.aws.amazon.com/eks/latest/userguide/pod-configuration.html
Confirm you can use aws-cli to access S3 from within a pod
Deploy a pod that has python + aws-cli; then copy and paste the duckdb code i provided and observe the failure yourself/hack on your python code to make it work.

piavka commented 4 months ago

I wanted to update that i'm getting same error and the s3 https endpoint is not set correctly as well like 'https://example-bucket./example.parquet' but in my case i'm using aws sso login based credentials setup with

LOAD httpfs;
LOAD aws;

CREATE SECRET (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN,
    CHAIN 'sso'
);

from FROM duckdb_secrets(); i see the credentials are set correctly for sso name=__default_s3;type=s3;provider=credential_chain;serializable=true;scope=s3://,s3n://,s3a://;key_id=ASIAQO6NX2IJCX2TWFB7;region=us-east-1;secret=redacted;session_token=redacted

grounded042 commented 4 months ago

I am also facing this issue. As suggested I tried using the regional endpoint, but that also does not work. Apologies for table output being wide.

PRAGMA version;
┌─────────────────┬────────────┐
│ library_version │ source_id  │
│     varchar     │  varchar   │
├─────────────────┼────────────┤
│ v0.10.0         │ 20b1486d11 │
└─────────────────┴────────────┘

CREATE SECRET test1 (TYPE S3, PROVIDER CREDENTIAL_CHAIN, ENDPOINT 's3.us-east-1.amazonaws.com');
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true    │
└─────────┘

FROM duckdb_secrets();
┌─────────┬─────────┬──────────────────┬────────────┬─────────┬─────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  name   │  type   │     provider     │ persistent │ storage │          scope          │                                                                 secret_string                                                                 │
│ varchar │ varchar │     varchar      │  boolean   │ varchar │        varchar[]        │                                                                    varchar                                                                    │
├─────────┼─────────┼──────────────────┼────────────┼─────────┼─────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ test1   │ s3      │ credential_chain │ false      │ memory  │ [s3://, s3n://, s3a://] │ name=test1;type=s3;provider=credential_chain;serializable=true;scope=s3://,s3n://,s3a://;endpoint=s3.us-east-1.amazonaws.com;region=us-east-1 │
└─────────┴─────────┴──────────────────┴────────────┴─────────┴─────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

I know this is correct by using aws s3api get-object:

aws s3api get-object --bucket REDACTED --key lu18fao0mekpszrx.json.gz testing.json.gz --debug
...
2024-03-04 16:12:55,197 - MainThread - urllib3.connectionpool - DEBUG - https://REDACTED.s3.us-east-1.amazonaws.com:443 "GET /lu18fao0mekpszrx.json.gz HTTP/1.1" 200 327616
...

While duckdb fails:

SELECT * FROM read_json('s3://REDACTED/lu18fao0mekpszrx.json.gz', format = 'newline_delimited',
 columns = {Testing: 'STRING'});
Error: HTTP Error: HTTP GET error on 'https://REDACTED.s3.us-east-1.amazonaws.com/lu18fao0mekpsz
rx.json.gz' (HTTP 403)

duarteocarmo commented 2 months ago

Also something I'm seeing!

Instead of pulling directly from AWS like:

    cursor.execute(
        f"""
        INSTALL aws;
        INSTALL httpfs;
        LOAD aws;
        LOAD httpfs;
        CREATE SECRET secret3 (
            TYPE S3,
            PROVIDER CREDENTIAL_CHAIN,
           -- tried all different options here
        );

        CREATE OR REPLACE TABLE country_data AS
        SELECT *
        FROM parquet_scan('{folder}/something=*/*.parquet', filename=true);
        """
    )
    logger.info("Data loaded into DuckDB.")

We were seeing a lot of 403 errors (we run our pods on EKS)

My hacked alternative is the following (but does not perform as well as the first from my tests)

def load_data() -> None:
    """Load data from S3 into DuckDB."""
    folder = f"s3://{S3_BUCKET}/{something}"
    data_file = "data.parquet"
    if os.path.exists(data_file):
        os.remove(data_file)
        logger.info("Removed existing data file.")

    logger.info(f"Loading parquet data from {folder}.")
    df = wr.s3.read_parquet(path=folder, boto3_session=SESSION, dataset=True)
    logger.info("Data loaded.")

    df.to_parquet(data_file, index=False)
    del df  # noqa: F821 Remove df to free up memory
    logger.info("Data saved to parquet file.")

    cursor.execute(
        f"""
        CREATE OR REPLACE TABLE country_data AS
        SELECT *
        FROM read_parquet('{data_file}', filename=true);
        """
    )
    logger.info("Data loaded into DuckDB.")

    os.remove(data_file)
    logger.info("Data file removed.")

    logger.info("Dataset loaded successfully.")

So would be very cool to fix this!!!!!

DanCardin commented 1 month ago

I dont know if relevant, but I used FROM load_aws_credentials(redact_secret=false) to steal the keys from the loaded secret and ran aws sts get-caller-identity against them.

This yielded "Arn": "arn:aws:sts::<numbers>:assumed-role/...-k8s-worker-node-role/...". Whereas awscli with the vanilla AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE env vars yielded "arn:aws:sts::<numbers>:assumed-role/<the role name i actually want>-role/...".

Making it seem like it's ignoring AWS_ROLE_ARN or otherwise assuming the wrong role.

What I find weird is that it all works fine locally, if I steal the AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN from the pod and attempt the same duckdb call with CHAIN 'sts'.

EDIT: Also, my current workaround is as such (since boto3 is transparently assuming the right role already):

aws_session = boto3.Session()
creds = aws_session.get_credentials().get_frozen_credentials()

db = duckdb.connect()
db.execute(
    f"""
    CREATE SECRET aws_secret (
        TYPE S3,
        REGION '{aws_session.region_name}',
        KEY_ID '{creds.access_key}',
        SECRET '{creds.secret_key}',
        SESSION_TOKEN '{creds.token}'
    )
    """
)

osalloum commented 1 month ago

Also still not working for me, the credentials chain is seemingly ignoring the "sts" authentication method

Maybe if we get a debug build with verbose logging enabled AWS_LOGSTREAM_DEBUG that can help better understand which methods is it trying and why it is failing

Furthermore I don't understand this part: we can create many secrets of type S3 , but which one would the extension use? i haven't found a parameter which i can pass to load_aws_credentials to reference a secret :(

CREATE SECRET secret2 (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
);
CREATE SECRET secret3 (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
    CHAIN 'env'
);
CREATE SECRET secret4 (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
    CHAIN 'sts'
);

nikcollins commented 4 weeks ago

This is a big problem for us. We have a pretty strict no static credentials policy and that seems to be the only thing that works in EKS.

lukeman commented 4 weeks ago

As of the DuckDB 1.0 release I'm still not seeing support for temporary credentials retrieved via STS AssumeRole. While default auth will produce a key/token, trying to auth using a config profile that sets a role_arn or setting AWS_ROLE_ARN in your environment always fails to retrieve assumed/temporary credentials. This is the case both when using the legacy load_aws_credentials and when creating SECRETs with default and custom chains.

The only way I've been able to get things working in a pinch is to run aws configure export-credentials --profile my-profile --format env which returns a key+secret+token as you'd expect. Once those temporary credentials are set in either your env or inside of DuckDB it works as expected (until the credentials expire, of course).

osalloum commented 4 weeks ago

I might have a clue around why this is not working yet.

I think that the sts module of the aws sdk should be statically linked by CMake/ vcpkg

The current build does not specify STS as dependency , hence the web identity token file logic never gets checked https://github.com/duckdb/duckdb_aws/blob/42c78d3f99e1a188a2b178ea59e3c17907af4fb2/CMakeLists.txt#L19

I am guessing that this would be a bit similar to AWS sdk in java where if it does not find sts in the CLASSPATH, it wouldn't attempt it

-- Update 1 --

I just validated that current build does not link sts, if you check the logs here from github actions https://github.com/duckdb/duckdb_aws/actions/runs/9126003286/job/25093373083#step:11:42 if you search for aws-sdk-cpp, you would find Installing 19/19 aws-sdk-cpp[core,dynamodb,kinesis,s3]:x64-linux@1.11.225...

-- Update 2 --

I was able to add sts and build the aws sts library but now i need to build duckdb with the same build tag ( to validate the built extension :(

https://github.com/osalloum/duckdb_aws/pull/1 https://github.com/osalloum/duckdb_aws/actions/runs/9356957119?pr=1#artifacts

D LOAD './aws.duckdb_extension';
Invalid Input Error: Failed to load './aws.duckdb_extension', The file was built for DuckDB version 'aaa2b5a18b', but we can only load extensions built for DuckDB version 'v0.10.3'.

I will try pick this up later this week

lukeman commented 4 weeks ago

@osalloum I set up a duckdb local build yesterday as I was hoping to be able to debug things (my experience mirror yours in enabling Profile authentication with the Java SDK). I didn't get far with that plan, but I was just able to toss your branch artifact into ~/.duckdb/extensions and load it.

At least in my testing this build doesn't appear to fix things right away (both with the below approach and with creating a SECRET. YMMV of course as I haven't touched C++ dev in 20 years.

D load aws;
100% ▕████████████████████████████████████████████████████████████▏ 
D SELECT extension_name, extension_version, install_mode FROM duckdb_extensions() where extension_version != '';
┌────────────────┬───────────────────┬───────────────────┐
│ extension_name │ extension_version │   install_mode    │
│    varchar     │      varchar      │      varchar      │
├────────────────┼───────────────────┼───────────────────┤
│ aws            │ c5beeb9           │ UNKNOWN           │
│ httpfs         │ aaa2b5a18b        │ STATICALLY_LINKED │
│ parquet        │ aaa2b5a18b        │ STATICALLY_LINKED │
└────────────────┴───────────────────┴───────────────────┘
D CALL load_aws_credentials('my-profile');
100% ▕████████████████████████████████████████████████████████████▏ 
┌──────────────────────┬──────────────────────────┬──────────────────────┬───────────────┐
│ loaded_access_key_id │ loaded_secret_access_key │ loaded_session_token │ loaded_region │
│       varchar        │         varchar          │       varchar        │    varchar    │
├──────────────────────┼──────────────────────────┼──────────────────────┼───────────────┤
│                      │                          │                      │ us-east-1     │
└──────────────────────┴──────────────────────────┴──────────────────────┴───────────────┘

Hoping to poke around more once I wrap my head around how to set up a decent out of tree extension dev workflow.

osalloum commented 4 weeks ago

@lukeman Good news

I was not able to set up a local dev env on Mac, not quickly as I would like ( I would love to hear some tips on that )

However I forked the duckdb repo and used to build a version without extension version checks https://github.com/osalloum/duckdb/actions/runs/9359759025?pr=2#artifacts

Then I uploaded my artifacts to a kubernetes pod on our account and tried it, it works I just used the default call, which should be the default credentials chain priority which would be: env variables, then web identity, then container credentials, then instance profile ( there might be more options like program options first for java )

loading_extensions then i validated the credentials on my local machine and they indeed map to the correct federated role credentials_masked

samansmink commented 2 weeks ago

PR attempting to resolve this is now merged and available with

force install aws from core_nightly

please let me know if this still fails

osalloum commented 2 weeks ago

FYI

This still somehow does not work with all docker images. If you try to use Alpine or Slim images, it wouldn't work .

but if you use an image with an Amazon Linux based docker it would work

As examples: amazoncorretto:21 --> works eclipse-temurin:21 --> does not work

public.ecr.aws/lambda/python:3.10 --> works

etc

yinzhs commented 1 week ago

Thanks. It worked for me using duckdb v1.0.0 on Linux, and the pre-built/released linux_amd64_gcc4 httpfs and aws extensions.

In the aws k8s pod environment I had, saw these environment variables: AWS_REGION, AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, AWS_STS_REGIONAL_ENDPOINTS, AWS_DEFAULT_REGION; but there is no AWS_ACCESS_KEY_ID, AWS_SESSION_TOKEN, AWS_SECRET_ACCESS_KEY set, nor ~/.aws/credential and ~/.aws/config, which i think is by design/convention per policy.

Here is the sample code snippet that worked for me in aws s3 and localstack, which fixed HTTP 403 (and HTTP 400, IO Error: Connection error for HTTP GET to, Unknown error for HTTP GET to) along the way

    db.query(f"SET s3_region = '{s3_args['region_name']}';")  # always present
    if s3_args.get('aws_access_key_id') and s3_args.get('aws_secret_access_key'):
        # static config, e.g. for localstack. but may be unavailable in aws k8s
        db.query(f"SET s3_access_key_id = '{s3_args['aws_access_key_id']}'; "
                 f"SET s3_secret_access_key = '{s3_args['aws_secret_access_key']}'; ")
    if s3_args.get('endpoint_url'):
        db.query(f"SET s3_endpoint = '{s3_args['endpoint_url'].split('://')[1]}'; "
                 f"SET s3_use_ssl = {str(s3_args['endpoint_url'].startswith('https:')).lower()};")
    else:
        db.query(f"SET s3_endpoint = 's3.{s3_args['region_name']}.amazonaws.com'; "
                 f"SET s3_use_ssl = true; ")
    db.query("SET s3_url_style = 'path';")  # path vs virtual_hosted, re: endpoint url  # HTTP 400

    if not s3_args.get('aws_access_key_id'):
        # this helped to fix the httpfs HTTP 403 error without access key/secret in aws,
        # which seemed to have something to do with sts assumed role in aws eks.
        db.query("LOAD aws; CREATE SECRET secret3 (TYPE S3, PROVIDER CREDENTIAL_CHAIN);")
        # log.debug('*** redacted secret: %s', db.execute('select secret_string from duckdb_secrets(redact=true);').fetchall())
        #
        # alternative way to get key and secret, not needed as the above worked
        # creds = boto3.Session().get_credentials().get_frozen_credentials()

Hope this feedback is helpful to someone. Let me know if you have comment and suggestion. Thanks.

enote-kane commented 1 week ago

We are running latest DuckDB 1.0.0 inside Windmill scripts (usually Python) and the only workaround that made it work for us here (after verifying that DuckDB will see all necessary environment variables) is the one found by @DanCardin .

We are also running Windmill inside EKS with AWS IAM service accounts providing the necessary auth, so the following environment variables are set properly and are working as intended with other SDKs on the exact same pod inside the same container:

AWS CLI
boto3 (Python)
deno-aws_api (Deno/TypeScript)