Closed mdesmet closed 3 years ago
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.
In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.
CLA has not been signed by users: @mdesmet
hello @mdesmet
Do you think it is necessary to apply same implements for __create_csv_table
also ? When seeding, this macros is called instead of create_table_as
I need a comma here, before '\n' to run with our trino/hive configuration. Not sure about others. https://github.com/dbt-labs/dbt-presto/pull/58/commits/650fb2781c268ea8399b91fd05ddf01a055a582b#diff-0b5044213c0ce72456e38a752114fedb09c0623a7256be2a2fa3c6b4a7291f2bR65
@bachng2017: the comma is indeed necessary. Serves as a good reminder to myself to actually test before committing last minute changes.
Following two points:
Hi @bachng2017 ,
I have an iceberg docker setup running and there it works for both the seeds and the models.
dbt_project.yaml:
seeds:
trino_project:
+catalog: iceberg
+schema: datalake
+with_props:
format: "'PARQUET'"
partitioning: ARRAY['bucket(series_reference, 2)']
Generating following create table statement on Trino:
create table "iceberg"."datalake"."trx" (Series_reference VARCHAR,Period DOUBLE,Data_value DOUBLE,Suppressed VARCHAR,STATUS VARCHAR,UNITS VARCHAR,Magnitude INTEGER,Subject VARCHAR,Group_name VARCHAR,Series_title_1 VARCHAR,Series_title_2 VARCHAR,Series_title_3 VARCHAR,Series_title_4 INTEGER,Series_title_5 INTEGER) WITH (format = 'PARQUET',
partitioning = ARRAY['bucket(series_reference, 2)'])
cool and thanks :) Seems that the maintainers are waiting for your CLA to be signed. Hope this will be merged soon
CLA has not been signed by users: @mdesmet
The CLA has been signed after the first commit.
@jtcohen6: any other remarks before this PR can be merged?
Hi @jtcohen6,
The concept is called table properties. So maybe properties is a better name.
I have considered that approach mentioned in #53. However presto/trino is different from athena that it supports way more connectors.
For example take the Kudu connector:
CREATE TABLE user_events (
user_id int WITH (primary_key = true),
event_name varchar WITH (primary_key = true),
message varchar,
details varchar WITH (nullable = true, encoding = 'plain')
) WITH (
partition_by_hash_columns = ARRAY['user_id'],
partition_by_hash_buckets = 5,
number_of_replicas = 3
);
Another example is the Hive connector:
CREATE TABLE hive.avro.avro_data (
id bigint
)
WITH (
format = 'AVRO',
avro_schema_url = '/usr/local/avro_data.avsc'
)
Another for clickhouse:
CREATE TABLE default.trino_ck (
id int NOT NULL,
birthday DATE NOT NULL,
name VARCHAR,
age BIGINT,
logdate DATE NOT NULL
)
WITH (
engine = 'MergeTree',
order_by = ARRAY['id', 'birthday'],
partition_by = ARRAY['toYYYYMM(logdate)'],
primary_key = ARRAY['id'],
sample_by = 'id'
);
It will be difficult to maintain and follow up with all connectors, hence the freedom of the dict[str, str] approach.
Thanks for the thorough response @mdesmet! I'm happy with properties
as the config name, and I totally buy your rationale about there being too many connector-specific properties to state them all explicitly. That flexibility is worth the cost of losing finer-grained config inheritance + validation.
We don't currently have a suite of custom tests for Presto-specific functionality in this plugin. The code change looks straightforward, and most of the logic lives inside Jinja macros, so users have the ability to override/reimplement if they need to. If this is something you've tested locally, and verified that it's working for the cases you're interested in, I'd be ok with merging this for inclusion in v0.21.
Great, it's tested locally and working. Actually testing it is definitely possible through including a few docker images. That's how i've tested it, see https://github.com/innoverio/modern-data-stack
Very cool repo! I hadn't seen before, thanks for linking to it.
I'm thinking more along the lines of custom dbt integration tests to automate in CI, such as these ones for dbt-spark. That leverages a testing framework we've played around with, but haven't all the way locked down. If we find we need automated testing for more complex dbt-presto
features, we should set up similar scaffolding.
I'll backport onto 0.21.latest
now, so that this feature will become available in the next v0.21 prerelease.
hello @mdesmet .
Talking about this, do you think we should have 2 kinds of properties: the current one and something line table_properties
In the implement macro, then we mix those 2 properties together before create the final one. This will help to solve the issue of common/individual setting while make sure users could use various formats for different connectors.
For example we will have a `transactional: true
set in the project level and partitioned_by
is set in table level of the config file or in each model. Increase the number of properties keyword might make user more confused.
@bachng2017 : Can you create a new issue for further discussion?
I have following thoughts:
Fixes #53
By adding with_props in the model, we can enable users to add properties when creating tables. As presto/Trino supports a lot of different adapters, for now a Dict[String, String] type enables flexibility.
Following query is executed on Trino