ClickHouse / dbt-clickhouse

The Clickhouse plugin for dbt (data build tool)
Apache License 2.0
245 stars 107 forks source link

Standard `table` materialization does not work with ReplicateMergeTree engine #95

Closed mharrisb1 closed 10 months ago

mharrisb1 commented 2 years ago

Not sure if this is related to #14 since I didn't explore the attached PR, but I wanted to call out this problem specifically.

It is my understanding that the way dbt-clickhouse handles table materializations is to:

  1. Rename table (if exists) to {{ this }}__dbt_tmp or something similar (I don't have the logs in front of me)
  2. Create a new table with name {{ this }}
  3. Drop temp table

With the Replication tables, this results in an error because there already exists a table with the replica path of /clickhouse/tables/{shard}/{database}/{table_name}.

I am currently getting around this using an incremental materialization with a config like:

{{ config(
    materialized="incremental",
    engine="ReplicatedMergeTree('/clickhouse/tables/{database}/{table}', '{replica}')",
    partition_by=[...],
    order_by=[...],
    unique_key=[...],
    pre_hook="truncate table if exists {{ this }} on cluster '...'",
    inserts_only=true
) }}

I'm not sure how we could handle a simple table materialization using the current rename-create-drop strategy since I don't believe you can update the replica path after creation and I don't believe there is a random value substitution provided which could potentially allow a replica path like /clickhouse/tables/{shard}/{database}/{table_name}__{random_uuid}

genzgd commented 2 years ago

Some of this may be solved by using the Replicated database engine. See https://github.com/ClickHouse/ClickHouse/issues/31661. I'll be doing some testing with Replicated databases over the next couple of weeks.

genzgd commented 2 years ago

Tests work with with Replicated database engine so explicit support for ReplicatedMergeTree engines will probably not be a priority.

mharrisb1 commented 2 years ago

For anyone that stumbles across this issue and is still using ReplicatedMergeTree engine like us, here is a better workaround:

Project config

# dbt_project.yml

vars:
  # Creates a unique ID for this dbt run using the run's timestamp
  run_id: "{{ run_started_at.strftime('%Y%m%d%H%M%S%f') }}"

  # Use this engine template for all models materialized as `table` that use `ReplicatedMergeTree` engine
  # Zookeeper path will use default layout but the table name is suffixed with the run's ID
  engine_template: >
    ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/{table}_{{ var("run_id") }}', '{replica}')

Model config

-- my_table.sql

{{ config(
    materialized="table",
    engine=var("engine_template"),
    partition_by=[...],
    order_by=[...],
    unique_key=[...]
) }}

SELECT ...
FROM ...
tema-popov commented 1 year ago

@mharrisb1 Is run_id scheme somehow differs from using standard {uuid} clickhouse have? i.e. engine = "ReplicatedReplacingMergeTree('/clickhouse/tables/{layer}-{shard}/{database}/{table}/{uuid}', '{replica}')"

mharrisb1 commented 1 year ago

@tema-popov I wasn't aware of that template. That would be a much better solution than mine. I will test it out and update this issue if it works. Thanks!

genzgd commented 1 year ago

@mharrisb1 Do you know if this is fixed by the latest update + the {uuid} macro?