dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
297 stars 119 forks source link

When name of column is "UNION", external table creation failes #66

Closed codigo-ergo-sum closed 3 years ago

codigo-ergo-sum commented 3 years ago

Describe the bug

We have incoming data with various column names that we want to stage using dbt-external-tables. One of the column names in the incoming data is "UNION". It is union because the business meaning is "the name of the labor union to which the employee belongs." When we create a .yml file referencing this column and run dbt run-operation stage_external_sources --vars 'ext_full_refresh: true' then we get this error message:

SQL compilation error: syntax error line 35 at position 12 unexpected 'UNION'. syntax error line 35 at position 47 unexpected ''''. syntax error line 35 at position 50 unexpected '::'.

Example .yml file (truncated to remove other sensitive names):

version: 2

sources:
  - name: our_source
    database: RAW_DEV
    loader: s3

    tables:
      - name: employee

        external:
          location: "@raw_dev.our_source.employee_stg"
          file_format: "raw_dev.our_source.pipe_delimited_csv_fmt"
          auto_refresh: true

        columns:
          - name: FID_ID
            data_type: VARCHAR

          - name: CLIENT_ID
            data_type: VARCHAR

          - name: UNION
            data_type: VARCHAR

External table creation statement generated from this (that errors out) - edited to again remove sensitive columns:

create or replace external table RAW_DEV.our_source.employee_start_card(
            FID_ID VARCHAR as (nullif(value:c1,'')::VARCHAR),
            CLIENT_ID VARCHAR as (nullif(value:c2,'')::VARCHAR),
            UNION VARCHAR as (nullif(value:c3,'')::VARCHAR)
    )
    location = @raw_dev.our_source.employee_stg 
    auto_refresh = True

    file_format = raw_dev.our_source.pipe_delimited_csv_fmt

We know that it is an issue with the keyword UNION because if we change UNION to UNION_ in the .yml file, everything runs fine.

We've also tried enclosing the column name in single quotes, double quotes, and backticks without luck.

Steps to reproduce

See as above

Expected results

Keywords like "UNION" would not interfere with external table configuration.

Actual results

Errors as above

Screenshots and log output

As above

System information

The contents of your packages.yml file:

packages:
 - package: fishtown-analytics/dbt_utils
   version: 0.6.3
 - package: fishtown-analytics/dbt_external_tables
   version: 0.6.0

Which database are you using dbt with?

The output of dbt --version:

installed version: 0.18.1
   latest version: 0.18.1

Up to date!

Plugins:
  - bigquery: 0.18.1
  - snowflake: 0.18.1
  - redshift: 0.18.1
  - postgres: 0.18.1

The operating system you're using: Mac OS X

The output of python --version: Python 2.7.16

jtcohen6 commented 3 years ago

@codigo-ergo-sum Thanks for the detailed write-up. I've thought about this a bit more, and I've come to feel that we're really at the mercy of Snowflake keywords here. Since you are loading a CSV file, and you're matching columns by ordinal position rather than name (as in JSON/parquet/etc), your first and best choice is to pick a different column name :)

But we can still find a way to make this work. The create external table statement succeeds if we quote the keyword:

create or replace external table RAW_DEV.our_source.employee_start_card(
            FID_ID VARCHAR as (nullif(value:c1,'')::VARCHAR),
            CLIENT_ID VARCHAR as (nullif(value:c2,'')::VARCHAR),
            "UNION" VARCHAR as (nullif(value:c3,'')::VARCHAR)
    )
    location = @raw_dev.our_source.employee_stg 
    auto_refresh = True

    file_format = raw_dev.our_source.pipe_delimited_csv_fmt

That puts us at the mercy of Snowflake's casing + quoting behavior. You'd always need to query it as:

select "UNION" from raw_dev.our_source.employee_start_card

This, by contrast, would raise an error:

select "union" from raw_dev.our_source.employee_start_card

That's something we'll just have to live with, I think. This is why we don't quote identifiers in Snowflake unless we absolutely must. (Vs. Redshift, where we can just quote everything.)

Possible today

As far as defining that quoting behavior in the dbt .yml file, I think you could manage this with whichever of the following you mind the least:

          - name: '"UNION"'
            data_type: VARCHAR

          - name: "\"UNION"\"
            data_type: VARCHAR

Should be possible

          - name: UNION
            data_type: VARCHAR
            quote: true

Setting this property is what we recommend to get schema tests working with a quoted column (docs). I can open a PR to make the external tables package take account of quote: true, too.