dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
294 stars 119 forks source link

Correcting handling partitions in Spark and adding sample source #161

Closed pgoslatara closed 1 year ago

pgoslatara commented 1 year ago

Setup

CSV file:

employee_id,employee_name
1,Mary
2,John

Source:

version: 2

sources:
  - name: adls
    tables:
      - name: dummy_csv
        external:
          location: '/mnt/test/dummy'
          using: csv
          options:
            sep: ','
            header: 'true'
          partitions:
            - name: year
              data_type: int
            - name: month
              data_type: int
            - name: day
              data_type: int

        columns:
          - name: employee_id
            data_type: int
          - name: employee_name
            data_type: string
          - name: year
            data_type: int
          - name: month
            data_type: int
          - name: day
            data_type: int

Description & motivation

Addressing #160.

Previously this package handled Spark partitions like this:

    create table adls.dummy_csv (

            employee_id int,
            employee_name string,
             year int,
             month int,
             day int
    )  using csv
    options ('sep' = ',',
'header' = 'true')
     partitioned by (year int, month int, day int)

    location '/mnt/test/dummy'

Note that the data type is included in the partitioned by clause which causes the query to fail.

This PR corrects this query to:

    create table adls.dummy_csv (

            employee_id int,
            employee_name string,
            year int,
            month int,
            day int
    )  using csv
    options ('sep' = ',', 'header' = 'true')
    partitioned by (year , month , day )

    location '/mnt/test/dummy'

The sample source has also been updated to correctly demonstrate how to specify partitions.

Checklist

pgoslatara commented 1 year ago

@jtcohen6 Looks like a connection issue in the Databricks failed job. Can you help with this?

jeremyyeo commented 1 year ago

@pgoslatara am supposed to be helping @jtcohen6 out with this package :) There was an issue with the pyodbc dependency with the dbt-spark adapter which should be fixed in #159.

guillesd commented 1 year ago

@jeremyyeo hey we'd been wanting to fix this macro for a while, so it'd be nice if we can merge this one soon :) @pgoslatara maybe change the example to don't add data_type in partitions since this is not needed for Spark anyway and it's defined at the column level, e.g.

partitions:
    - name: year
    - name: month
    - name: day 

Or you could even argue you can make this a plain list and then only refer to partition instead of partition.name

pgoslatara commented 1 year ago

Nice idea @guillesd! I've updated this branch to take a list as input to partitions while also ensuring backward compatibility.

This is dependent on this PR in dbt-core to update the acceptable schemas for partitions.

@jeremyyeo What do you think of this updated approach?

jeremyyeo commented 1 year ago

Hey @pgoslatara and @guillesd, traditionally, this package behaviour was that the column you specified in the partitions key should not also be one that is in the columns key.

sources:
  - name: foo
    external:
      ..
      partitions:
        - name: year      # <<< Column used for partitioning.
          data_type: int
    columns:
      - name: year        # <<< But also in the column spec.
        data_type: int
      - name: month
        data_type: int

As in you've specified column year for the table to be partitioned by, you cannot also specify it in the columns key. Such a thing caused the query to be generated as:

create table foo (
    year int,
    month int,
)
partitioned by (
    year int
)
...

Which is invalid for sure - https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-partition.html#parameters. If we follow along with the docs as well there are 2 scenarios:

A) The column we're partitioning by is also in the column specification

create table foo (
    year int,
    month int,
)
partitioned by (
    year
)
...

This is what you've proposed and I think it will work fine.

B) The column we're partitioning by is not in the column specification

create table foo (
    month int
)
partitioned by (
    year int
)
...

With the current PR - instead of a query like the above, we get something like (which would then be incorrect since there is no data type on the year):

create table foo (
    month int
)
partitioned by (
    year
)
...

Assuming a YAML with our "traditional format"(which I believe some number of users of this package would have in their project):

sources:
  - name: foo
    external:
      ..
      partitions:
        - name: year
          data_type: int
    columns:
      - name: month
        data_type: int

I think we should support both (A) and (B) scenarios here, what do you think? The jinja does get a bit more complicated because we need to check that IF a partition is specified in the column spec, then don't include the data type but otherwise include it.

pgoslatara commented 1 year ago

@jeremyyeo Thanks for taking the time to take a detailed look into this.

As you point out, there are two acceptable approaches here per the docs. Specifying a column type in the PARTITIONED BY clause is only necessary when the partition is not included in the columns. If the partition is included in the columns then no column type is necessary. For example, both of these are valid queries and result in identical tables being created:

create table adls.dummy_csv (
  employee_id int,
  employee_name string
) using csv
options ('sep' = ',', 'header' = 'true')
partitioned by (year int, month int, day int)
location '/mnt/test/dummy'
create table adls.dummy_csv (
  employee_id int,
  employee_name string,
  year int,
  month int,
  day int
) using csv
options ('sep' = ',', 'header' = 'true')
partitioned by (year, month, day)
location '/mnt/test/dummy'

Given the outcome of each approach is the same, the implemented approach does not matter hugely.

Your request is to handle cases where: a) The partition is specified only in partitions b) The partition is specified both in columns and partitions

In both cases the partition is specified with both a name and a data_type. (The question of what happens if inconsistent data_type values are specified is not addressed.)

The latest commits use the second approach from above to handle cases where partitions are specified according to case a) or case b):

version: 2

sources:
  - name: adls
    tables:
      - name: dummy_csv_partitions_as_columns
        external:
          location: '/mnt/test/dummy'
          using: csv
          options:
            sep: ','
            header: 'true'
          partitions:
          - name: year
            data_type: int
          - name: month
            data_type: int
          - name: day
            data_type: int
        columns:
          - name: employee_id
            data_type: int
          - name: employee_name
            data_type: string
          - name: year
            data_type: int
          - name: month
            data_type: int
          - name: day
            data_type: int

      - name: dummy_csv_partitions_only
        external:
          location: '/mnt/test/dummy'
          using: csv
          options:
            sep: ','
            header: 'true'
          partitions:
          - name: year
            data_type: int
          - name: month
            data_type: int
          - name: day
            data_type: int
        columns:
          - name: employee_id
            data_type: int
          - name: employee_name
            data_type: string

The generated SQL is identical for both sources:

    create table adls.dummy_csv_partitions_as_columns (

            employee_id int,
            employee_name string,
            year int,
            month int,
            day int
    )  using csv
    options ('sep' = ',',
'header' = 'true')
    partitioned by (year, month, day)

    location '/mnt/test/dummy'

...

    create table adls.dummy_csv_partitions_only (

            employee_id int,
            employee_name string,
            year int,
            month int,
            day int
    )  using csv
    options ('sep' = ',',
'header' = 'true')
    partitioned by (year, month, day)

    location '/mnt/test/dummy'

The sample source has been updated to prefer the second approach as this is cleaner.

@jeremyyeo Long comment I know, what do you think?

pgoslatara commented 1 year ago

@jeremyyeo What do you think of this updated approach?

pgoslatara commented 1 year ago

@jeremyyeo What do you think of this updated approach?

jeremyyeo commented 1 year ago

Hey @pgoslatara apologies for the slow response. This looks good to me :) I added an {% if partitions %} to check for the existence of the partitions key because we were iterating even when there was no such partitions key in the source yml causing the macro to error since we can't iterate over None.

The misleading test pass strikes again (#171) but that's besides the point :P

Thanks for the help @pgoslatara.