dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
294 stars 119 forks source link

Spark partitions are not correctly defined #160

Closed pgoslatara closed 1 year ago

pgoslatara commented 1 year ago

Describe the bug

On Spark, partitions are not correctly handled.

Steps to reproduce

Sample csv file:

employee_id,employee_name
1,Mary
2,John

Source:

version: 2

sources:
  - name: adls
    tables:
      - name: dummy_csv
        external:
          location: '/mnt/test/dummy'
          using: csv
          options:
            sep: ','
            header: 'true'
          partitions:
            - name: year
              data_type: int
            - name: month
              data_type: int
            - name: day
              data_type: int

        columns:
          - name: employee_id
            data_type: int
          - name: employee_name
            data_type: string
          - name: year
            data_type: int
          - name: month
            data_type: int
          - name: day
            data_type: int

Expected results

A partitioned table on Spark.

Actual results

Query fails.

Screenshots and log output

dbt -d run-operation stage_external_sources --vars "ext_full_refresh: true"
...
    create table adls.dummy_csv (

            employee_id int,
            employee_name string,
            year int,
            month int,
            day int
    )  using csv
    options ('sep' = ',',
'header' = 'true')
    partitioned by (year int, month int, day int)

    location '/mnt/test/dummy'
...
08:34:59.514278 [error] [MainThread]: Encountered an error while running operation: Runtime Error
  Runtime Error
    Found duplicate column(s) in the table definition of `spark_catalog`.`adls`.`dummy_csv`: `day`, `month`, `year`

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/dbt_external_tables
    version: 0.8.0

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.2.1 - Up to date!

Plugins:
  - spark:      1.2.0 - Up to date!
  - databricks: 1.2.2 - Up to date!

The operating system you're using: Ubuntu 20.04 on WSL2

The output of python --version: Python 3.9.10

Additional context

jeremyyeo commented 1 year ago

Resolved via #161