Support to add with properties when creating Trino/Presto tables

mdesmet commented 3 years ago

Fixes #53

By adding with_props in the model, we can enable users to add properties when creating tables. As presto/Trino supports a lot of different adapters, for now a Dict[String, String] type enables flexibility.

{{
  config(
    materialized='table',
    with_props={
      "format": "'PARQUET'",
      "partitioning": "ARRAY['bucket(id, 2)']",
    }
  )
}}
select 
  * 
from {{ source('inventory', 'products') }}
WHERE weight < 5

Following query is executed on Trino

create table "iceberg"."datalake"."low_weight_products__dbt_tmp"
    WITH (format = 'PARQUET',partitioning = ARRAY['bucket(id, 2)'])
  as (

select 
  * 
from "postgres"."inventory"."products"
WHERE weight < 5
  )

cla-bot[bot] commented 3 years ago

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @mdesmet

bachng2017 commented 3 years ago

hello @mdesmet Do you think it is necessary to apply same implements for __create_csv_table also ? When seeding, this macros is called instead of create_table_as

bachng2017 commented 3 years ago

I need a comma here, before '\n' to run with our trino/hive configuration. Not sure about others. https://github.com/dbt-labs/dbt-presto/pull/58/commits/650fb2781c268ea8399b91fd05ddf01a055a582b#diff-0b5044213c0ce72456e38a752114fedb09c0623a7256be2a2fa3c6b4a7291f2bR65

mdesmet commented 3 years ago

@bachng2017: the comma is indeed necessary. Serves as a good reminder to myself to actually test before committing last minute changes.

Following two points:

The seed should indeed also adapted.
We should probably add a test for this feature into the integration test.

mdesmet commented 3 years ago

Hi @bachng2017 ,

I have an iceberg docker setup running and there it works for both the seeds and the models.

dbt_project.yaml:

seeds:
  trino_project:
      +catalog: iceberg
      +schema: datalake
      +with_props:
          format: "'PARQUET'"
          partitioning: ARRAY['bucket(series_reference, 2)']

Generating following create table statement on Trino:

create table "iceberg"."datalake"."trx" (Series_reference VARCHAR,Period DOUBLE,Data_value DOUBLE,Suppressed VARCHAR,STATUS VARCHAR,UNITS VARCHAR,Magnitude INTEGER,Subject VARCHAR,Group_name VARCHAR,Series_title_1 VARCHAR,Series_title_2 VARCHAR,Series_title_3 VARCHAR,Series_title_4 INTEGER,Series_title_5 INTEGER) WITH (format = 'PARQUET',
  partitioning = ARRAY['bucket(series_reference, 2)'])

bachng2017 commented 3 years ago

cool and thanks :) Seems that the maintainers are waiting for your CLA to be signed. Hope this will be merged soon

CLA has not been signed by users: @mdesmet

mdesmet commented 3 years ago

The CLA has been signed after the first commit.

@jtcohen6: any other remarks before this PR can be merged?

mdesmet commented 3 years ago

Hi @jtcohen6,

The concept is called table properties. So maybe properties is a better name.

I have considered that approach mentioned in #53. However presto/trino is different from athena that it supports way more connectors.

For example take the Kudu connector:

CREATE TABLE user_events (
  user_id int WITH (primary_key = true),
  event_name varchar WITH (primary_key = true),
  message varchar,
  details varchar WITH (nullable = true, encoding = 'plain')
) WITH (
  partition_by_hash_columns = ARRAY['user_id'],
  partition_by_hash_buckets = 5,
  number_of_replicas = 3
);

Another example is the Hive connector:

CREATE TABLE hive.avro.avro_data (
   id bigint
 )
WITH (
   format = 'AVRO',
   avro_schema_url = '/usr/local/avro_data.avsc'
)

Another for clickhouse:

CREATE TABLE default.trino_ck (
  id int NOT NULL,
  birthday DATE NOT NULL,
  name VARCHAR,
  age BIGINT,
  logdate DATE NOT NULL
)
WITH (
  engine = 'MergeTree',
  order_by = ARRAY['id', 'birthday'],
  partition_by = ARRAY['toYYYYMM(logdate)'],
  primary_key = ARRAY['id'],
  sample_by = 'id'
);

It will be difficult to maintain and follow up with all connectors, hence the freedom of the dict[str, str] approach.

jtcohen6 commented 3 years ago

Thanks for the thorough response @mdesmet! I'm happy with properties as the config name, and I totally buy your rationale about there being too many connector-specific properties to state them all explicitly. That flexibility is worth the cost of losing finer-grained config inheritance + validation.

We don't currently have a suite of custom tests for Presto-specific functionality in this plugin. The code change looks straightforward, and most of the logic lives inside Jinja macros, so users have the ability to override/reimplement if they need to. If this is something you've tested locally, and verified that it's working for the cases you're interested in, I'd be ok with merging this for inclusion in v0.21.

mdesmet commented 3 years ago

Great, it's tested locally and working. Actually testing it is definitely possible through including a few docker images. That's how i've tested it, see https://github.com/innoverio/modern-data-stack

jtcohen6 commented 3 years ago

Very cool repo! I hadn't seen before, thanks for linking to it.

I'm thinking more along the lines of custom dbt integration tests to automate in CI, such as these ones for dbt-spark. That leverages a testing framework we've played around with, but haven't all the way locked down. If we find we need automated testing for more complex dbt-presto features, we should set up similar scaffolding.

jtcohen6 commented 3 years ago

I'll backport onto 0.21.latest now, so that this feature will become available in the next v0.21 prerelease.

bachng2017 commented 3 years ago

hello @mdesmet . Talking about this, do you think we should have 2 kinds of properties: the current one and something line table_properties

In the implement macro, then we mix those 2 properties together before create the final one. This will help to solve the issue of common/individual setting while make sure users could use various formats for different connectors.

For example we will have a `transactional: true set in the project level and partitioned_by is set in table level of the config file or in each model. Increase the number of properties keyword might make user more confused.

mdesmet commented 3 years ago

@bachng2017 : Can you create a new issue for further discussion?

I have following thoughts:

How many levels of merging exist (project level, directory level, ...), so any key of the properties set on a lower level woud override or set the keys from a higher level.
Why name it differently as it really is the same properties?

dbt-labs / dbt-presto

Support to add with properties when creating Trino/Presto tables #58