duckdb / duckdb_aws

MIT License
34 stars 12 forks source link

httpfs s3 configuration values are not configured #5

Closed skmgoldin closed 10 months ago

skmgoldin commented 10 months ago

DuckDB version:

D PRAGMA version;
┌─────────────────┬────────────┐
│ library_version │ source_id  │
│     varchar     │  varchar   │
├─────────────────┼────────────┤
│ v0.8.2-dev3244  │ fb10d8eede │
└─────────────────┴────────────┘

Extension fails to configure httpfs s3 configuration values:

D INSTALL httpfs; LOAD httpfs; INSTALL aws; LOAD aws;
D CALL load_aws_credentials();
┌──────────────────────┐
│      loaded_key      │
│       varchar        │
├──────────────────────┤
│ AKIAXXXXXXXXXXXXXXXX │ # redacted
└──────────────────────┘
D SELECT * FROM duckdb_settings() WHERE name LIKE 's3%';
┌──────────────────────┬──────────────────┬──────────────────────────────────────────────────────────────┬────────────┐
│         name         │      value       │                         description                          │ input_type │
│       varchar        │     varchar      │                           varchar                            │  varchar   │
├──────────────────────┼──────────────────┼──────────────────────────────────────────────────────────────┼────────────┤
│ s3_uploader_thread…  │ 50               │ S3 Uploader global thread limit (default 50)                 │ UBIGINT    │
│ s3_uploader_max_fi…  │ 800GB            │ S3 Uploader max filesize (between 50GB and 5TB, default 80…  │ VARCHAR    │
│ s3_url_compatibili…  │ 0                │ Disable Globs and Query Parameters on S3 urls                │ BOOLEAN    │
│ s3_use_ssl           │ 1                │ S3 use SSL (default true)                                    │ BOOLEAN    │
│ s3_uploader_max_pa…  │ 10000            │ S3 Uploader max parts per file (between 1 and 10000, defau…  │ UBIGINT    │
│ s3_secret_access_key │                  │ S3 Access Key                                                │ VARCHAR    │
│ s3_session_token     │                  │ S3 Session Token                                             │ VARCHAR    │
│ s3_url_style         │ vhost            │ S3 url style ('vhost' (default) or 'path')                   │ VARCHAR    │
│ s3_endpoint          │ s3.amazonaws.com │ S3 Endpoint (default 's3.amazonaws.com')                     │ VARCHAR    │
│ s3_access_key_id     │                  │ S3 Access Key ID                                             │ VARCHAR    │
│ s3_region            │ us-east-1        │ S3 Region                                                    │ VARCHAR    │
├──────────────────────┴──────────────────┴──────────────────────────────────────────────────────────────┴────────────┤
│ 11 rows                                                                                                   4 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

I am a big fan of what this extension is trying to do, and I hope you succeed in achieving it!

stephaniewang526 commented 10 months ago

I was unable to repro this. I ran aws configure and then:

stephaniewang@Stephanies-MacBook-Pro duckdb_aws % ./build/debug/duckdb
v0.8.2-dev2700 acbbfe0e79
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D load aws;
D load httpfs;
D CALL load_aws_credentials();
┌──────────────────────┐
│      loaded_key      │
│       varchar        │
├──────────────────────┤
│ AKIA[REDACTED]   │
└──────────────────────┘
D select * from duckdb_settings() WHERE name LIKE 's3%';
┌────────────────────────────────┬──────────────────────────────────────────┬─────────────────────────────────────────────────────────────────────┬────────────┐
│              name              │                  value                   │                             description                             │ input_type │
│            varchar             │                 varchar                  │                               varchar                               │  varchar   │
├────────────────────────────────┼──────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────┼────────────┤
│ s3_uploader_thread_limit       │ 50                                       │ S3 Uploader global thread limit (default 50)                        │ UBIGINT    │
│ s3_uploader_max_filesize       │ 800GB                                    │ S3 Uploader max filesize (between 50GB and 5TB, default 800GB)      │ VARCHAR    │
│ s3_url_compatibility_mode      │ 0                                        │ Disable Globs and Query Parameters on S3 urls                       │ BOOLEAN    │
│ s3_use_ssl                     │ 1                                        │ S3 use SSL (default true)                                           │ BOOLEAN    │
│ s3_region                      │ us-east-1                                │ S3 Region                                                           │ VARCHAR    │
│ s3_uploader_max_parts_per_file │ 10000                                    │ S3 Uploader max parts per file (between 1 and 10000, default 10000) │ UBIGINT    │
│ s3_secret_access_key           │ XoH8[REDACTED] │ S3 Access Key                                                       │ VARCHAR    │
│ s3_access_key_id               │ AKIA[REDACTED]                     │ S3 Access Key ID                                                    │ VARCHAR    │
│ s3_url_style                   │ vhost                                    │ S3 url style ('vhost' (default) or 'path')                          │ VARCHAR    │
│ s3_endpoint                    │ s3.amazonaws.com                         │ S3 Endpoint (default 's3.amazonaws.com')                            │ VARCHAR    │
│ s3_session_token               │                                          │ S3 Session Token                                                    │ VARCHAR    │
├────────────────────────────────┴──────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────┴────────────┤
│ 11 rows                                                                                                                                            4 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D 

(I don't have a s3_session_token)

Could this have something to do with the DuckDB version used? Mine is built from the submodule in this repo:

D PRAGMA version;
┌─────────────────┬────────────┐
│ library_version │ source_id  │
│     varchar     │  varchar   │
├─────────────────┼────────────┤
│ v0.8.2-dev2700  │ acbbfe0e79 │
└─────────────────┴────────────┘
skmgoldin commented 10 months ago

I will try again with the pinned submodule version and report back!

skmgoldin commented 10 months ago

Weird, still doesn't work for me with acbbfe0e79. I even ran a fresh aws configure, as you did. I have no AWS* variables in my environment and aws sts get-caller-identity does work, returning my identity as that of the default profile in my config/credentials files.

The value it prints for the loaded_key after running load_aws_credentials() is correct, it just doesn't seem to be persisting anything to the duckdb settings.

I'm on MacOS 13.5.1, for what that's worth.

stephaniewang526 commented 10 months ago

I also cleared out my AWS env vars and it still works for me. To double-check, you see your creds in ~/.aws/credentials? And did you build the AWS extension from source also?

I even ran the duckdb_settings command before and after calling load and it behaved as expected:

stephaniewang@Stephanies-MacBook-Pro duckdb_aws % ./build/debug/duckdb
v0.8.2-dev2700 acbbfe0e79
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D install aws; load aws; load httpfs;
D select * from duckdb_settings() WHERE name LIKE 's3%';
┌────────────────────────────────┬──────────────────┬─────────────────────────────────────────────────────────────────────┬────────────┐
│              name              │      value       │                             description                             │ input_type │
│            varchar             │     varchar      │                               varchar                               │  varchar   │
├────────────────────────────────┼──────────────────┼─────────────────────────────────────────────────────────────────────┼────────────┤
│ s3_uploader_thread_limit       │ 50               │ S3 Uploader global thread limit (default 50)                        │ UBIGINT    │
│ s3_uploader_max_filesize       │ 800GB            │ S3 Uploader max filesize (between 50GB and 5TB, default 800GB)      │ VARCHAR    │
│ s3_url_compatibility_mode      │ 0                │ Disable Globs and Query Parameters on S3 urls                       │ BOOLEAN    │
│ s3_use_ssl                     │ 1                │ S3 use SSL (default true)                                           │ BOOLEAN    │
│ s3_region                      │                  │ S3 Region                                                           │ VARCHAR    │
│ s3_uploader_max_parts_per_file │ 10000            │ S3 Uploader max parts per file (between 1 and 10000, default 10000) │ UBIGINT    │
│ s3_secret_access_key           │                  │ S3 Access Key                                                       │ VARCHAR    │
│ s3_access_key_id               │                  │ S3 Access Key ID                                                    │ VARCHAR    │
│ s3_url_style                   │ vhost            │ S3 url style ('vhost' (default) or 'path')                          │ VARCHAR    │
│ s3_endpoint                    │ s3.amazonaws.com │ S3 Endpoint (default 's3.amazonaws.com')                            │ VARCHAR    │
│ s3_session_token               │                  │ S3 Session Token                                                    │ VARCHAR    │
├────────────────────────────────┴──────────────────┴─────────────────────────────────────────────────────────────────────┴────────────┤
│ 11 rows                                                                                                                    4 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D CALL load_aws_credentials();
┌──────────────────────┐
│      loaded_key      │
│       varchar        │
├──────────────────────┤
│ AKIA................... │
└──────────────────────┘
D select * from duckdb_settings() WHERE name LIKE 's3%';
┌────────────────────────────────┬──────────────────────────────────────────┬─────────────────────────────────────────────────────────────────────┬────────────┐
│              name              │                  value                   │                             description                             │ input_type │
│            varchar             │                 varchar                  │                               varchar                               │  varchar   │
├────────────────────────────────┼──────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────┼────────────┤
│ s3_uploader_thread_limit       │ 50                                       │ S3 Uploader global thread limit (default 50)                        │ UBIGINT    │
│ s3_uploader_max_filesize       │ 800GB                                    │ S3 Uploader max filesize (between 50GB and 5TB, default 800GB)      │ VARCHAR    │
│ s3_url_compatibility_mode      │ 0                                        │ Disable Globs and Query Parameters on S3 urls                       │ BOOLEAN    │
│ s3_use_ssl                     │ 1                                        │ S3 use SSL (default true)                                           │ BOOLEAN    │
│ s3_region                      │                                          │ S3 Region                                                           │ VARCHAR    │
│ s3_uploader_max_parts_per_file │ 10000                                    │ S3 Uploader max parts per file (between 1 and 10000, default 10000) │ UBIGINT    │
│ s3_secret_access_key           │ XoH................... │ S3 Access Key                                                       │ VARCHAR    │
│ s3_access_key_id               │ AKIA...................                     │ S3 Access Key ID                                                    │ VARCHAR    │
│ s3_url_style                   │ vhost                                    │ S3 url style ('vhost' (default) or 'path')                          │ VARCHAR    │
│ s3_endpoint                    │ s3.amazonaws.com                         │ S3 Endpoint (default 's3.amazonaws.com')                            │ VARCHAR    │
│ s3_session_token               │                                          │ S3 Session Token                                                    │ VARCHAR    │
├────────────────────────────────┴──────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────┴────────────┤
│ 11 rows                                                                                                                                            4 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Since you do see the loaded_key returned it's very odd that duckdb_settings (which pulls from DBConfig) isn't set since the DBConfig is being set as part of the table function call.

The only other thing I can think of is are you using the same DuckDB connection to load the credentials and to query duckdb_settings? For instance, if you load the creds in one DuckDB instance and then quit out and load DuckDB again the creds won't persist in the DuckDB settings.

skmgoldin commented 10 months ago

I got this working reliably, but I don't exactly understand why it works. If I build DuckDB from the duckdb repo at commit acbbfe0e79, then run this command sequence:

D INSTALL httpfs; LOAD httpfs; INSTALL aws; LOAD aws;
D CALL load_aws_credentials();
<...>
D SELECT * FROM duckdb_settings() WHERE name LIKE 's3%';

It fails as-described above. HOWEVER!

If I build DuckDB from this repo, the duckdb_aws repo, using the make command at the root of this repository, and then run the built binary and the same sequence of commands, it works.

The MD5 hash of the aws extension file in my ~/.duckdb directory for this commit differs from that in the build directory of my local copy of this repository, so perhaps that is a hint.

stephaniewang526 commented 10 months ago

Glad it's working for you! Odd observation for sure.. We shoudn't need to build from the hash specifically from this repo.. I think we will need to wait till DuckDB releases 0.9 (in September) to try and see if the extension works with the released binary (hope it does!) What a mystery...

skmgoldin commented 10 months ago

Thanks for your assistance debugging!

I will leave this open for now in case the maintainer has any comments, but my problem is solved, so, maintainer please feel free to close this if you have nothing to remark! I am curious why this does not work when installing the extension from a "supported" commit hash of duckdb, whereas building duckdb with the extension, and referencing the locally built extension does work.

samansmink commented 10 months ago

With PR https://github.com/duckdblabs/duckdb_aws/pull/8 this extension should be buildable from this repo with the duckdb commit from the submodule. Before 0.9 release we will update aws commit in main duckdb CI which will ensure the binaries distributed are of the most recent version of this extension.