duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
924 stars 89 forks source link

Glue Catalog integration throws error that 'secrets' variable is unset in call to _get_glue_client() during Plugin initialization #432

Closed bitadmiral closed 2 months ago

bitadmiral commented 2 months ago

I'm attempting to write parquet files to S3 while also registering the output as a table in the Glue Catalog. My profile target looks like below where I define the secrets key.

    dev_by_secrets:
      type: duckdb
      path: "{{ env_var('DUCKDB_DATABASE') }}"
      threads: 1
      extensions:
        - parquet
        - httpfs
      secrets:
        - type: s3
          region: "{{ env_var('AWS_REGION') }}"
          key_id: "{{ env_var('AWS_ACCESS_KEY_ID') }}"
          secret: "{{ env_var('AWS_SECRET_ACCESS_KEY') }}"
          session_token: "{{ env_var('AWS_SESSION_TOKEN') }}"

 My model configuration is configured as follows to create the table in Glue:

models:
  - name: test_condition_names_exttbl
    columns: 
    - name: condition_id
      data_type: VARCHAR
    - name: acg_condition_name
      data_type: VARCHAR
    - name: qm_condition_name
      data_type: VARCHAR
    config: 
      materialized: external
      format: parquet
      location: 's3://sandbox/(...hidden...)/test/test_condition_names_exttbl.parquet'
      glue_register: true
      glue_database: test

The SQL model is quite simple:

SELECT
    *
FROM
    read_csv(...hidden...)

If I make one change to the model configuration, glue_register: false, then the model runs successfully. It reads the CSV file from S3 and writes the parquet file to S3.

dbt run -t dev_by_secrets --select testing.test_condition_names_exttbl

22:59:32 1 of 1 START sql external model main.test_condition_names_exttbl ............ [RUN] 22:59:33 1 of 1 OK created sql external model main.test_condition_names_exttbl ....... [OK in 0.78s] 22:59:33
22:59:33 Finished running 1 external model in 0 hours 0 minutes and 1.51 seconds (1.51s). 22:59:33
22:59:33 Completed successfully 22:59:33
22:59:33 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

But with glue_register: true, I get this error:

- Unhandled error while executing .../test_condition_names_exttbl.sql
- cannot access local variable 'secrets' where it is not associated with a value

Running dbt run --debug shows this error is happening at line 348 in the initialize() function of the Plugin class in glue.py, in the call to _get_glue_client(config, secrets):

class Plugin(BasePlugin):
    def initialize(self, config: Dict[str, Any]):
        if self.creds is not None:
            secrets = self.creds.secrets
        self.client = _get_glue_client(config, secrets)
        self.database = config.get("glue_database", "default")
        self.delimiter = config.get("delimiter", ",")

I guess self.creds is not being set in the __init__ function of BasePlugin class, line 92 because the argument credentials: Optional[DuckDBCredentials] = None isn't being passed, but I don't know why not.

I have tried all of the following alternative target configurations and the result is the same error. However all of these will run the model successfully with glue_register: false.

    dev_by_env:
      type: duckdb
      path: "{{ env_var('DUCKDB_DATABASE') }}"
      threads: 1
      extensions:
        - parquet
        - httpfs
      settings:
        s3_region: "{{ env_var('AWS_REGION') }}"
        s3_access_key_id: "{{ env_var('AWS_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('AWS_SECRET_ACCESS_KEY') }}"
        s3_session_token: "{{ env_var('AWS_SESSION_TOKEN') }}"
        s3_use_ssl: true
        s3_url_style: vhost

    dev_by_chain_deprecated:
      type: duckdb
      use_credential_provider: aws        
      path: "{{ env_var('DUCKDB_DATABASE') }}"
      threads: 1

    dev_by_chain:
      type: duckdb
      provider: credential_chain        
      path: "{{ env_var('DUCKDB_DATABASE') }}"
      threads: 1

In each test case I verified that my credentials stored in the AWS... environment variables were still valid by running commands like aws s3 ls ..., aws s3api get-object --bucket <bucket_name> --key <object> <outfile>. I'm just not sure what needs to be done in order to get the plugin to take the secret and pass it to the glue client.

FYI, here is my current pip list:

Package                   Version
------------------------- ---------------
agate                     1.9.1
annotated-types           0.7.0
attrs                     24.2.0
azure-core                1.30.2
azure-identity            1.17.1
azure-storage-blob        12.22.0
babel                     2.16.0
boto3                     1.35.9
botocore                  1.35.9
cachetools                5.4.0
certifi                   2024.7.4
cffi                      1.17.0
charset-normalizer        3.3.2
click                     8.1.7
colorama                  0.4.6
cryptography              43.0.0
daff                      1.3.46
dbt-adapters              1.4.1
dbt-common                1.7.0
dbt-core                  1.8.6
dbt-duckdb                1.8.3
dbt-extractor             0.5.1
dbt-loom                  0.5.6
dbt-semantic-interfaces   0.5.1
deepdiff                  7.0.1
duckdb                    1.0.0
google-api-core           2.19.1
google-auth               2.33.0
google-cloud-core         2.4.1
google-cloud-storage      2.18.2
google-crc32c             1.5.0
google-resumable-media    2.7.2
googleapis-common-protos  1.63.2
idna                      3.7
importlib-metadata        6.11.0
isodate                   0.6.1
Jinja2                    3.1.4
jmespath                  1.0.1
jsonschema                4.23.0
jsonschema-specifications 2023.12.1
leather                   0.4.0
Logbook                   1.5.3
MarkupSafe                2.1.5
mashumaro                 3.13.1
minimal-snowplow-tracker  0.0.2
more-itertools            10.4.0
msal                      1.30.0
msal-extensions           1.2.0
msgpack                   1.0.8
mypy-boto3-glue           1.35.3
networkx                  3.3
numpy                     2.0.1
ordered-set               4.1.0
packaging                 24.1
parsedatetime             2.6
pathspec                  0.12.1
pip                       24.2
portalocker               2.10.1
proto-plus                1.24.0
protobuf                  4.25.4
pyasn1                    0.6.0
pyasn1_modules            0.4.0
pycparser                 2.22
pydantic                  2.8.2
pydantic_core             2.20.1
PyJWT                     2.9.0
python-dateutil           2.9.0.post0
python-slugify            8.0.4
pytimeparse               1.1.8
pytz                      2024.1
PyYAML                    6.0.2
referencing               0.35.1
requests                  2.32.3
rpds-py                   0.20.0
rsa                       4.9
s3transfer                0.10.2
six                       1.16.0
sqlparse                  0.5.1
text-unidecode            1.3
types-networkx            3.2.1.20240811
types-PyYAML              6.0.12.20240808
typing_extensions         4.12.2
urllib3                   2.2.2
zipp                      3.20.0

(I should also mention my issue is similar to issue #83 which was closed with a mystery solution.)

The environment in which I am running dbt is Ubuntu 22.04.4 LTS (Jammy Jellyfish) running in AWS Workspaces (essentially an EC2 instance).

jwills commented 2 months ago

@guenp to take a look at this-- looks like it's related to the secrets changes in https://github.com/duckdb/dbt-duckdb/pull/403

FWIW @bitadmiral if you can run on dbt-duckdb 1.8.1 for this, it should work using the legacy Glue registration code

guenp commented 2 months ago

Thank you @bitadmiral for the super thorough bug report and clear repro steps. I think I've found the bug, I narrowed it down to this line, where we're not passing the credentials to the plugin constructor: https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/environments/local.py#L151