dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
297 stars 119 forks source link

[QUESTION] Best way to create partitions for data partitioned by year/month/day #123

Closed nickc1 closed 1 year ago

nickc1 commented 2 years ago

Hello!

Thank you for this awesome package. This could really help simplify our architecture.

I'm trying to find the best way to generate the partitions. Our data is currently partitioned in the following format:

year=2020/month=06/day=09

I tried the following:

version: 2

sources:
  - name: mysourcename
    database: mydatabase
    schema: myschema
    loader: S3
    loaded_at_field: collector_tstamp

    tables:
      - name: mytablename
        external:
          location: "s3://data-lake/raw/datasource/models/mymodel"        # Amazon S3 path prefix
          row_format: >                       # Hive specification
            serde 'org.openx.data.jsonserde.JsonSerDe'
            with serdeproperties (
                'strip.outer.array'='false'
            )
          partitions:
            - name: year
              data_type: char(4)
              vals:         # macro w/ keyword args to generate list of values
                macro: dbt.dates_in_range
                args:
                  start_date_str: '2021-12-01'
                  end_date_str: '{{modules.datetime.date.today().strftime("%Y-%m-%d")}}'
                  in_fmt: "%Y-%m-%d"
                  out_fmt: "%Y"
              path_macro: dbt_external_tables.key_value

              # expression: SUBSTR(metadata$filename,16,4)  #ingestion_year=2010
            - name: month
              data_type: char(2)
              vals:         # macro w/ keyword args to generate list of values
                macro: dbt.dates_in_range
                args:
                  start_date_str: '2021-12-01'
                  end_date_str: '{{modules.datetime.date.today().strftime("%Y-%m-%d")}}'
                  in_fmt: "%Y-%m-%d"
                  out_fmt: "%m"
              path_macro: dbt_external_tables.key_value
            - name: day
              data_type: char(2)
              vals:         # macro w/ keyword args to generate list of values
                macro: dbt.dates_in_range
                args:
                  start_date_str: '2021-12-01'
                  end_date_str: '{{modules.datetime.date.today().strftime("%Y-%m-%d")}}'
                  in_fmt: "%Y-%m-%d"
                  out_fmt: "%d"
              path_macro: dbt_external_tables.key_value

        columns:
          - name: col1
            data_type: varchar(255)
            description: "col1 description"

          - name: col2
            data_type: int
            description: "col2 description"

          - name: col3
            data_type: varchar(32)
            description: "col3 description"

          - name: col4
            data_type: varchar(65000)     
            description: "col4 description"

The above works, but since a cartesion product is performed under the hood, I end up recreating partitions multiple times. Is there a more efficient way to do this?

Thanks for the help! And thanks for putting this package together!

jtcohen6 commented 2 years ago

Hey @nickc1, glad you've gotten a lot out of the package! Sorry for the (super) delayed response from me.

Question: Do you need year+ month + day to exist as separate columns? (They're not really independent of each other, right?)

The way I've done this before is by treating these as three components of the same composite "date" column, and then using the year_month_day path macro instead of key_value:

            - name: my_date_column
              data_type: date
              vals:
                macro: dbt.dates_in_range
                args:
                  start_date_str: '2021-12-01'
                  end_date_str: '{{modules.datetime.date.today().strftime("%Y-%m-%d")}}'
                  in_fmt: "%Y-%m-%d"
                  out_fmt: "%Y-%m-%d"
              path_macro: dbt_external_tables.year_month_day
github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.