TobikoData / sqlmesh

Efficient data transformation and modeling framework that is backwards compatible with dbt.
https://sqlmesh.com
Apache License 2.0
1.65k stars 148 forks source link

Support automatic table properties for Redshift/Snowflake and Incremental by Time models #1391

Open eakmanrq opened 1 year ago

eakmanrq commented 1 year ago

Snowflake should define the time columns as part of the cluster_by key. Redshift should define it as part of it's sort_key key. We need to make sure that if a user defines these in the table properties that it gets merged properly against what is being automatically defined by the model.

Also note that we need to establish canonical terms for these things to make them consistent across engines. Proposal: Partition By:

Sort By:

Dist By:

So for example with partition by time the reasoning for the engine adapter would be this: "If I support partition by, then use that for the time column. If not, use Sort By. If I support neither then do nothing"

erindru commented 3 months ago

For Redshift, do you see any problem exposing DISTKEY as partition_by and removing the "Dist By" concept altogether since it only applies for Redshift?

PiotrKatvitski commented 1 month ago

For Redshift, do you see any problem exposing DISTKEY as partition_by and removing the "Dist By" concept altogether since it only applies for Redshift?

Dist By (aka sharding, distribution) is also supported by other DB engines, such as Greenplum and Clickhouse. Some databases provide it through PARTITION BY syntax (e.g. partition by hash), but it has vastly different use cases and deserves to be called a different thing.

By distinguishing partitioning from sharding we'll also encourage engineers using sqlmesh to find out the difference between those concepts and make the right choice.