dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
309 stars 124 forks source link

Redshift ALTER statement missing partitions from sources.yml #58

Closed jasonicarter closed 3 years ago

jasonicarter commented 3 years ago

Describe the bug

Alter table statement (Redshift) missing partition statements. It appears to only use the first and last defined partitions in the sources.yml file. I've tried it a number of times with different number of partitions (always the 1st and last shown in dbt.logs)

Steps to reproduce

(this was my original setup, the test yml below is a simplified version which causes the same error) My setup using AWS Glue so I'll try to include everything I did...

  1. Put data into S3 bucket with 4 partitions (folders) with a top level split. Example s3://my_bucket/type=subject_1/year=2020/month=11/day=29/hour=01 (after some testing it appears this doesn't matter).
  2. Using AWS Glue setup a crawler with the "top level split" being the table name. This gives you a table name of "subject_1" with data having partitions year, month and day (data is already partitioned by Glue - probably using external-tables is redundant but besides the point for this issue).
  3. Create a Redshift external schema (as this is expected by the package). At this point if you run a query select values, location from svv_external_partitions where tablename = 'type_subject_1' you'll get something like values = ["2020","11","29","01"] and location = s3://my_bucket/type=subject_1/year=2020/month=11/day=29/hour=01/
  4. Follow package instructions - setup yml and execute dbt run-operation stage_external_sources --vars 'ext_full_refresh: true' (yml snippet below)

Expected results

A new external table should be created with the correct paritions

Actual results

Operation failed. Table created with correct columns but an error message - Encountered an error while running operation: Database Error The number of partition keys do not match the number of partition values

Screenshots and log output

dbt.log Create external table (created correctly in db)

create external table "prod"."spectrum_schema"."spectrum_table_name" ( "id" varchar(255), "version" int ) partitioned by ("num_one" varchar(255), "num_two" varchar(255), "num_three" varchar(255)) row format serde 'org.openx.data.jsonserde.JsonSerDe' with serdeproperties ( 'strip.outer.array'='false' ) location 's3://my_bucket/'

Alter table (error occurs)

alter table "prod"."spectrum_schema"."spectrum_table_name"add if not exists partition (num_one='2020',num_three='10') location 's3://my_bucket/num_one=2020/num_three=10/'

2020-11-30 00:00:25.633851 (MainThread): Postgres error: The number of partition keys do not match the number of partition values 2020-11-30 00:00:25.634332 (MainThread): Error running SQL: macro stage_external_sources 2020-11-30 00:00:25.634566 (MainThread): Rolling back transaction. 2020-11-30 00:00:25.634787 (MainThread): On macro_stage_external_sources: Close 2020-11-30 00:00:25.635441 (MainThread): Encountered an error while running operation: Database Error The number of partition keys do not match the number of partition values

System information

The contents of your packages.yml file: packages:

Which database are you using dbt with?

The output of dbt --version:

installed version: 0.18.1
   latest version: 0.18.1

Up to date!

Plugins:
  - bigquery: 0.18.1
  - snowflake: 0.18.1
  - redshift: 0.18.1
  - postgres: 0.18.1

The operating system you're using: macOS Cataline v10.15.7

Additional context

test yml file

image
jtcohen6 commented 3 years ago

Hey @jasonicarter, could you try out the changes in #59? I'm not in love with my approach, but I'm hopeful that it resolves this bug in the short term:

packages:
  - git: https://github.com/fishtown-analytics/dbt-external-tables
    revision: fix/redshift-multipartitioned
jasonicarter commented 3 years ago

Hey @jasonicarter, could you try out the changes in #59? I'm not in love with my approach, but I'm hopeful that it resolves this bug in the short term:

packages:
  - git: https://github.com/fishtown-analytics/dbt-external-tables
    revision: fix/redshift-multipartitioned

Hi @jtcohen6 , I can confirm the package change above successfully created and altered the external table. Using the same sources.yml as in the original issue, please see the dbt.log snippet below...

        begin; commit; 
        alter table "prod"."spectrum_schema"."spectrum_table" add if not exists 
        partition (num_one='2020', num_two='11', num_three='10')
        location 's3://my_bucket/num_one=2020/num_two=11/num_three=10/'