Closed pgoslatara closed 1 year ago
@jtcohen6 Looks like a connection issue in the Databricks failed job. Can you help with this?
@pgoslatara am supposed to be helping @jtcohen6 out with this package :) There was an issue with the pyodbc dependency with the dbt-spark adapter which should be fixed in #159.
@jeremyyeo hey we'd been wanting to fix this macro for a while, so it'd be nice if we can merge this one soon :) @pgoslatara maybe change the example to don't add data_type in partitions since this is not needed for Spark anyway and it's defined at the column level, e.g.
partitions:
- name: year
- name: month
- name: day
Or you could even argue you can make this a plain list and then only refer to partition
instead of partition.name
Nice idea @guillesd! I've updated this branch to take a list as input to partitions
while also ensuring backward compatibility.
This is dependent on this PR in dbt-core
to update the acceptable schemas for partitions
.
@jeremyyeo What do you think of this updated approach?
Hey @pgoslatara and @guillesd, traditionally, this package behaviour was that the column you specified in the partitions
key should not also be one that is in the columns
key.
sources:
- name: foo
external:
..
partitions:
- name: year # <<< Column used for partitioning.
data_type: int
columns:
- name: year # <<< But also in the column spec.
data_type: int
- name: month
data_type: int
As in you've specified column year
for the table to be partitioned by, you cannot also specify it in the columns
key. Such a thing caused the query to be generated as:
create table foo (
year int,
month int,
)
partitioned by (
year int
)
...
Which is invalid for sure - https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-partition.html#parameters. If we follow along with the docs as well there are 2 scenarios:
create table foo (
year int,
month int,
)
partitioned by (
year
)
...
This is what you've proposed and I think it will work fine.
create table foo (
month int
)
partitioned by (
year int
)
...
With the current PR - instead of a query like the above, we get something like (which would then be incorrect since there is no data type on the year
):
create table foo (
month int
)
partitioned by (
year
)
...
Assuming a YAML with our "traditional format"(which I believe some number of users of this package would have in their project):
sources:
- name: foo
external:
..
partitions:
- name: year
data_type: int
columns:
- name: month
data_type: int
I think we should support both (A) and (B) scenarios here, what do you think? The jinja does get a bit more complicated because we need to check that IF a partition is specified in the column spec, then don't include the data type but otherwise include it.
@jeremyyeo Thanks for taking the time to take a detailed look into this.
As you point out, there are two acceptable approaches here per the docs. Specifying a column type in the PARTITIONED BY
clause is only necessary when the partition is not included in the columns. If the partition is included in the columns then no column type is necessary. For example, both of these are valid queries and result in identical tables being created:
create table adls.dummy_csv (
employee_id int,
employee_name string
) using csv
options ('sep' = ',', 'header' = 'true')
partitioned by (year int, month int, day int)
location '/mnt/test/dummy'
create table adls.dummy_csv (
employee_id int,
employee_name string,
year int,
month int,
day int
) using csv
options ('sep' = ',', 'header' = 'true')
partitioned by (year, month, day)
location '/mnt/test/dummy'
Given the outcome of each approach is the same, the implemented approach does not matter hugely.
Your request is to handle cases where:
a) The partition is specified only in partitions
b) The partition is specified both in columns
and partitions
In both cases the partition is specified with both a name
and a data_type
. (The question of what happens if inconsistent data_type
values are specified is not addressed.)
The latest commits use the second approach from above to handle cases where partitions are specified according to case a) or case b):
version: 2
sources:
- name: adls
tables:
- name: dummy_csv_partitions_as_columns
external:
location: '/mnt/test/dummy'
using: csv
options:
sep: ','
header: 'true'
partitions:
- name: year
data_type: int
- name: month
data_type: int
- name: day
data_type: int
columns:
- name: employee_id
data_type: int
- name: employee_name
data_type: string
- name: year
data_type: int
- name: month
data_type: int
- name: day
data_type: int
- name: dummy_csv_partitions_only
external:
location: '/mnt/test/dummy'
using: csv
options:
sep: ','
header: 'true'
partitions:
- name: year
data_type: int
- name: month
data_type: int
- name: day
data_type: int
columns:
- name: employee_id
data_type: int
- name: employee_name
data_type: string
The generated SQL is identical for both sources:
create table adls.dummy_csv_partitions_as_columns (
employee_id int,
employee_name string,
year int,
month int,
day int
) using csv
options ('sep' = ',',
'header' = 'true')
partitioned by (year, month, day)
location '/mnt/test/dummy'
...
create table adls.dummy_csv_partitions_only (
employee_id int,
employee_name string,
year int,
month int,
day int
) using csv
options ('sep' = ',',
'header' = 'true')
partitioned by (year, month, day)
location '/mnt/test/dummy'
The sample source has been updated to prefer the second approach as this is cleaner.
@jeremyyeo Long comment I know, what do you think?
@jeremyyeo What do you think of this updated approach?
@jeremyyeo What do you think of this updated approach?
Hey @pgoslatara apologies for the slow response. This looks good to me :) I added an {% if partitions %}
to check for the existence of the partitions
key because we were iterating even when there was no such partitions
key in the source yml causing the macro to error since we can't iterate over None
.
The misleading test pass strikes again (#171) but that's besides the point :P
Thanks for the help @pgoslatara.
Setup
CSV file:
Source:
Description & motivation
Addressing #160.
Previously this package handled Spark partitions like this:
Note that the data type is included in the
partitioned by
clause which causes the query to fail.This PR corrects this query to:
The sample source has also been updated to correctly demonstrate how to specify partitions.
Checklist