Open joegoggins opened 4 months ago
@cpaika @j-hartshorn @osalloum I was wondering if you had a chance to validate the #16 fix from PR #23?
Do you experience the same issues I'm seeing here or did your problems go away? If you were able to get things working, could you share the code you used to make it work?
duckdb authors: Let me know if you need any additional info to help troubleshoot.
Hey @joegoggins Thanks for opening the issue, I haven't really had time to investigate this properly so this is very helpful.
Setting up some test infrastructure with AWS that can be used by the aws
extension CI to properly test the credential chain provider is definitely on my TODO list
Thanks for the quick reply @samansmink. I'll provide a few more ideas that might allow you to repro/fix this issue quickly. On our side, it is high priority that we solve this, so let me know if you want additional details/want to jump on a call to help/etc:
The same program that uses duckdb
also uses awswrangler, which does AWS permissions correctly, I'm able to access S3 this way but not with duckdb. Perhaps you could analyze how that module integrates with the AWS SDK and identify the root problem/solution.
My guess on cause of bug within duckdb is that the downstream code that deals with the outcome of AWS_STS_REGIONAL_ENDPOINTS=regional
isn't working correctly. It seems like there could be an interpolation error where the regional endpoint is empty, so when it gets interpolated into the endpoint to hit s3, it is wrong i.e. https://example-bucket./example.parquet
, when i'd expect something like https://example-bucket.us-east-1.s3.amazon.com/example.parquet
Setting up some test infrastructure with AWS that can be used by the aws extension CI to properly test the credential chain provider is definitely on my TODO list
Before tinkering with CI, I'd recommend trying a scrappy approach to get the basic repro down and get your head around how the eks.amazonaws.com/role-arn
annotation works. A path could be:
I wanted to update that i'm getting same error and the s3 https endpoint is not set correctly as well like 'https://example-bucket./example.parquet'
but in my case i'm using aws sso login
based credentials setup with
LOAD httpfs;
LOAD aws;
CREATE SECRET (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN,
CHAIN 'sso'
);
from FROM duckdb_secrets();
i see the credentials are set correctly for sso
name=__default_s3;type=s3;provider=credential_chain;serializable=true;scope=s3://,s3n://,s3a://;key_id=ASIAQO6NX2IJCX2TWFB7;region=us-east-1;secret=redacted;session_token=redacted
I am also facing this issue. As suggested I tried using the regional endpoint, but that also does not work. Apologies for table output being wide.
PRAGMA version;
┌─────────────────┬────────────┐
│ library_version │ source_id │
│ varchar │ varchar │
├─────────────────┼────────────┤
│ v0.10.0 │ 20b1486d11 │
└─────────────────┴────────────┘
CREATE SECRET test1 (TYPE S3, PROVIDER CREDENTIAL_CHAIN, ENDPOINT 's3.us-east-1.amazonaws.com');
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true │
└─────────┘
FROM duckdb_secrets();
┌─────────┬─────────┬──────────────────┬────────────┬─────────┬─────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ name │ type │ provider │ persistent │ storage │ scope │ secret_string │
│ varchar │ varchar │ varchar │ boolean │ varchar │ varchar[] │ varchar │
├─────────┼─────────┼──────────────────┼────────────┼─────────┼─────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ test1 │ s3 │ credential_chain │ false │ memory │ [s3://, s3n://, s3a://] │ name=test1;type=s3;provider=credential_chain;serializable=true;scope=s3://,s3n://,s3a://;endpoint=s3.us-east-1.amazonaws.com;region=us-east-1 │
└─────────┴─────────┴──────────────────┴────────────┴─────────┴─────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
I know this is correct by using aws s3api get-object
:
aws s3api get-object --bucket REDACTED --key lu18fao0mekpszrx.json.gz testing.json.gz --debug
...
2024-03-04 16:12:55,197 - MainThread - urllib3.connectionpool - DEBUG - https://REDACTED.s3.us-east-1.amazonaws.com:443 "GET /lu18fao0mekpszrx.json.gz HTTP/1.1" 200 327616
...
While duckdb fails:
SELECT * FROM read_json('s3://REDACTED/lu18fao0mekpszrx.json.gz', format = 'newline_delimited',
columns = {Testing: 'STRING'});
Error: HTTP Error: HTTP GET error on 'https://REDACTED.s3.us-east-1.amazonaws.com/lu18fao0mekpsz
rx.json.gz' (HTTP 403)
Also something I'm seeing!
Instead of pulling directly from AWS like:
cursor.execute(
f"""
INSTALL aws;
INSTALL httpfs;
LOAD aws;
LOAD httpfs;
CREATE SECRET secret3 (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN,
-- tried all different options here
);
CREATE OR REPLACE TABLE country_data AS
SELECT *
FROM parquet_scan('{folder}/something=*/*.parquet', filename=true);
"""
)
logger.info("Data loaded into DuckDB.")
We were seeing a lot of 403 errors (we run our pods on EKS)
My hacked alternative is the following (but does not perform as well as the first from my tests)
def load_data() -> None:
"""Load data from S3 into DuckDB."""
folder = f"s3://{S3_BUCKET}/{something}"
data_file = "data.parquet"
if os.path.exists(data_file):
os.remove(data_file)
logger.info("Removed existing data file.")
logger.info(f"Loading parquet data from {folder}.")
df = wr.s3.read_parquet(path=folder, boto3_session=SESSION, dataset=True)
logger.info("Data loaded.")
df.to_parquet(data_file, index=False)
del df # noqa: F821 Remove df to free up memory
logger.info("Data saved to parquet file.")
cursor.execute(
f"""
CREATE OR REPLACE TABLE country_data AS
SELECT *
FROM read_parquet('{data_file}', filename=true);
"""
)
logger.info("Data loaded into DuckDB.")
os.remove(data_file)
logger.info("Data file removed.")
logger.info("Dataset loaded successfully.")
So would be very cool to fix this!!!!!
I dont know if relevant, but I used FROM load_aws_credentials(redact_secret=false)
to steal the keys from the loaded secret and ran aws sts get-caller-identity
against them.
This yielded "Arn": "arn:aws:sts::<numbers>:assumed-role/...-k8s-worker-node-role/..."
. Whereas awscli with the vanilla AWS_ROLE_ARN
and AWS_WEB_IDENTITY_TOKEN_FILE
env vars yielded "arn:aws:sts::<numbers>:assumed-role/<the role name i actually want>-role/..."
.
Making it seem like it's ignoring AWS_ROLE_ARN
or otherwise assuming the wrong role.
What I find weird is that it all works fine locally, if I steal the AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN from the pod and attempt the same duckdb call with CHAIN 'sts'
.
EDIT: Also, my current workaround is as such (since boto3 is transparently assuming the right role already):
aws_session = boto3.Session()
creds = aws_session.get_credentials().get_frozen_credentials()
db = duckdb.connect()
db.execute(
f"""
CREATE SECRET aws_secret (
TYPE S3,
REGION '{aws_session.region_name}',
KEY_ID '{creds.access_key}',
SECRET '{creds.secret_key}',
SESSION_TOKEN '{creds.token}'
)
"""
)
Also still not working for me, the credentials chain is seemingly ignoring the "sts" authentication method
Maybe if we get a debug build with verbose logging enabled AWS_LOGSTREAM_DEBUG
that can help better understand which methods is it trying and why it is failing
Furthermore I don't understand this part: we can create many secrets of type S3 , but which one would the extension use? i haven't found a parameter which i can pass to load_aws_credentials
to reference a secret :(
CREATE SECRET secret2 (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN
);
CREATE SECRET secret3 (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN
CHAIN 'env'
);
CREATE SECRET secret4 (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN
CHAIN 'sts'
);
This is a big problem for us. We have a pretty strict no static credentials policy and that seems to be the only thing that works in EKS.
As of the DuckDB 1.0 release I'm still not seeing support for temporary credentials retrieved via STS AssumeRole. While default auth will produce a key/token, trying to auth using a config profile that sets a role_arn or setting AWS_ROLE_ARN
in your environment always fails to retrieve assumed/temporary credentials. This is the case both when using the legacy load_aws_credentials and when creating SECRETs with default and custom chains.
The only way I've been able to get things working in a pinch is to run aws configure export-credentials --profile my-profile --format env
which returns a key+secret+token as you'd expect. Once those temporary credentials are set in either your env or inside of DuckDB it works as expected (until the credentials expire, of course).
I might have a clue around why this is not working yet.
I think that the sts module of the aws sdk should be statically linked by CMake/ vcpkg
The current build does not specify STS as dependency , hence the web identity token file logic never gets checked https://github.com/duckdb/duckdb_aws/blob/42c78d3f99e1a188a2b178ea59e3c17907af4fb2/CMakeLists.txt#L19
I am guessing that this would be a bit similar to AWS sdk in java where if it does not find sts in the CLASSPATH, it wouldn't attempt it
-- Update 1 --
I just validated that current build does not link sts, if you check the logs here from github actions
https://github.com/duckdb/duckdb_aws/actions/runs/9126003286/job/25093373083#step:11:42
if you search for aws-sdk-cpp, you would find Installing 19/19 aws-sdk-cpp[core,dynamodb,kinesis,s3]:x64-linux@1.11.225...
-- Update 2 --
I was able to add sts and build the aws sts library but now i need to build duckdb with the same build tag ( to validate the built extension :(
https://github.com/osalloum/duckdb_aws/pull/1 https://github.com/osalloum/duckdb_aws/actions/runs/9356957119?pr=1#artifacts
D LOAD './aws.duckdb_extension';
Invalid Input Error: Failed to load './aws.duckdb_extension', The file was built for DuckDB version 'aaa2b5a18b', but we can only load extensions built for DuckDB version 'v0.10.3'.
I will try pick this up later this week
@osalloum I set up a duckdb local build yesterday as I was hoping to be able to debug things (my experience mirror yours in enabling Profile authentication with the Java SDK). I didn't get far with that plan, but I was just able to toss your branch artifact into ~/.duckdb/extensions
and load it.
At least in my testing this build doesn't appear to fix things right away (both with the below approach and with creating a SECRET. YMMV of course as I haven't touched C++ dev in 20 years.
D load aws;
100% ▕████████████████████████████████████████████████████████████▏
D SELECT extension_name, extension_version, install_mode FROM duckdb_extensions() where extension_version != '';
┌────────────────┬───────────────────┬───────────────────┐
│ extension_name │ extension_version │ install_mode │
│ varchar │ varchar │ varchar │
├────────────────┼───────────────────┼───────────────────┤
│ aws │ c5beeb9 │ UNKNOWN │
│ httpfs │ aaa2b5a18b │ STATICALLY_LINKED │
│ parquet │ aaa2b5a18b │ STATICALLY_LINKED │
└────────────────┴───────────────────┴───────────────────┘
D CALL load_aws_credentials('my-profile');
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────────────┬──────────────────────────┬──────────────────────┬───────────────┐
│ loaded_access_key_id │ loaded_secret_access_key │ loaded_session_token │ loaded_region │
│ varchar │ varchar │ varchar │ varchar │
├──────────────────────┼──────────────────────────┼──────────────────────┼───────────────┤
│ │ │ │ us-east-1 │
└──────────────────────┴──────────────────────────┴──────────────────────┴───────────────┘
Hoping to poke around more once I wrap my head around how to set up a decent out of tree extension dev workflow.
@lukeman Good news
I was not able to set up a local dev env on Mac, not quickly as I would like ( I would love to hear some tips on that )
However I forked the duckdb repo and used to build a version without extension version checks https://github.com/osalloum/duckdb/actions/runs/9359759025?pr=2#artifacts
Then I uploaded my artifacts to a kubernetes pod on our account and tried it, it works I just used the default call, which should be the default credentials chain priority which would be: env variables, then web identity, then container credentials, then instance profile ( there might be more options like program options first for java )
then i validated the credentials on my local machine and they indeed map to the correct federated role
PR attempting to resolve this is now merged and available with
force install aws from core_nightly
please let me know if this still fails
FYI
This still somehow does not work with all docker images. If you try to use Alpine or Slim images, it wouldn't work .
but if you use an image with an Amazon Linux based docker it would work
As examples: amazoncorretto:21 --> works eclipse-temurin:21 --> does not work
public.ecr.aws/lambda/python:3.10 --> works
etc
Thanks. It worked for me using duckdb v1.0.0 on Linux, and the pre-built/released linux_amd64_gcc4 httpfs and aws extensions.
In the aws k8s pod environment I had, saw these environment variables: AWS_REGION, AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, AWS_STS_REGIONAL_ENDPOINTS, AWS_DEFAULT_REGION; but there is no AWS_ACCESS_KEY_ID, AWS_SESSION_TOKEN, AWS_SECRET_ACCESS_KEY set, nor ~/.aws/credential and ~/.aws/config, which i think is by design/convention per policy.
Here is the sample code snippet that worked for me in aws s3 and localstack, which fixed HTTP 403 (and HTTP 400, IO Error: Connection error for HTTP GET to, Unknown error for HTTP GET to) along the way
db.query(f"SET s3_region = '{s3_args['region_name']}';") # always present
if s3_args.get('aws_access_key_id') and s3_args.get('aws_secret_access_key'):
# static config, e.g. for localstack. but may be unavailable in aws k8s
db.query(f"SET s3_access_key_id = '{s3_args['aws_access_key_id']}'; "
f"SET s3_secret_access_key = '{s3_args['aws_secret_access_key']}'; ")
if s3_args.get('endpoint_url'):
db.query(f"SET s3_endpoint = '{s3_args['endpoint_url'].split('://')[1]}'; "
f"SET s3_use_ssl = {str(s3_args['endpoint_url'].startswith('https:')).lower()};")
else:
db.query(f"SET s3_endpoint = 's3.{s3_args['region_name']}.amazonaws.com'; "
f"SET s3_use_ssl = true; ")
db.query("SET s3_url_style = 'path';") # path vs virtual_hosted, re: endpoint url # HTTP 400
if not s3_args.get('aws_access_key_id'):
# this helped to fix the httpfs HTTP 403 error without access key/secret in aws,
# which seemed to have something to do with sts assumed role in aws eks.
db.query("LOAD aws; CREATE SECRET secret3 (TYPE S3, PROVIDER CREDENTIAL_CHAIN);")
# log.debug('*** redacted secret: %s', db.execute('select secret_string from duckdb_secrets(redact=true);').fetchall())
#
# alternative way to get key and secret, not needed as the above worked
# creds = boto3.Session().get_credentials().get_frozen_credentials()
Hope this feedback is helpful to someone. Let me know if you have comment and suggestion. Thanks.
We are running latest DuckDB 1.0.0 inside Windmill scripts (usually Python) and the only workaround that made it work for us here (after verifying that DuckDB will see all necessary environment variables) is the one found by @DanCardin .
We are also running Windmill inside EKS with AWS IAM service accounts providing the necessary auth, so the following environment variables are set properly and are working as intended with other SDKs on the exact same pod inside the same container:
https://github.com/duckdb/duckdb_aws/issues/16 was marked closed on Jan 2 as part of https://github.com/duckdb/duckdb_aws/pull/23 being merged. Unfortunately, the solution described in the PR body did not work, i.e. #16 does not seem fixed to me. I didn't see any of the authors/commenters of that issue confirm the fix either, so I figured I'd file a bug report here.
Repro
Re-read #16. The problem/repro described here is the same, but with more technical detail that I hoping helps to solve the underlying problem.
Open a pod shell on EKS that defines the following vars:
Run
aws s3 ls s3://example-bucket/example.parquet
, observe successful output like this:This proves AWS CLI / pod perms are good.
Run
./duckdb -unredacted -json
to open duckdb console:Create a secret:
it takes a few seconds here and comes back successful.
Run
select secret_string from duckdb_secrets(redact=false);
Observe that the expected data is there -- key_id, secret, and session_token are set with values that look legit.Run a query against the s3 bucket:
SELECT * FROM read_parquet('s3://example-bucket/example.parquet');
Observe an error like this:
Note; there is no 403 error here. It looks like somehow the endpoint is not getting set correctly--that https endpoint is clearly not valid. I tried specifying ENDPOINT to 's3.amazonaws.com' in the create secret command, but then I get an error like this:
I've tried lots of different permutations to get around this problem to no avail, here is a list of of other things I've tried/learned:
The problem happens using
duckdb@0.10.0
in python as well. Similar repro, same error:Using
CHAIN 'config;env'
,CHAIN sts;instance
, or other variants don't fix the problem.