ClickHouse / dbt-clickhouse

The Clickhouse plugin for dbt (data build tool)
Apache License 2.0
255 stars 113 forks source link

Question: What determines when a `materialized_view` is fully recreated #288

Closed the4thamigo-uk closed 5 months ago

the4thamigo-uk commented 5 months ago

I'm seeing my materialized views being completely recreated from scratch and fully bakfilled, more than necessary. What logic determines when this happens?

the4thamigo-uk commented 5 months ago

hmm, it always seems to fully repopulate the entire table every time I run...

the4thamigo-uk commented 5 months ago

I suggested an idea as a PR for discussion.

tema-popov commented 5 months ago

I have a little bit more broad issue with materialised views which includes the topic of the question.

I use material views over very large tables, so when I try to run a new material view model in dbt, it tries to populate it by running an insert into query.

In my case, it is very large so I Clickhouse fails to run it. For some reason, I expected that it would just create an mv and target table without populating it.

The current logic is also potentially prone to handling data, which will come moments in time between the populate query and creating materialized view, as both operations are not transactional and the populate query takes some time to run.

The potential workaround for me right now is to avoid this big populate query by adding additional conditions into model SQL that would filter only "fresh data".

The overall issue looks very similar to already existing challenges in incremental models which are handled by is_incremental logic. Maybe we should use the same approach both in incremental and materialised_view models?

It would include the issue with unsupported --full-refresh logic, as it is a part of expected incremental model behavior

Maybe I would even expect the materialized view to be a special subtype of the incremental model, where "increments" are handled by created mat view instead of dbt runs.

the4thamigo-uk commented 5 months ago

@tema-popov I also have similar concerns about missing out some data. With the --full-refresh repopulation switched off, there is still a small time gap between dropping the MV and recreating it, in which time data might be missed.

the4thamigo-uk commented 5 months ago

Maybe https://clickhouse.com/docs/en/sql-reference/statements/alter/view should be used to alter the MV query without interrupting the ingeation process?

the4thamigo-uk commented 5 months ago

@tema-popov The PR for this is now merged. Maybe you want to raise a new issue for the problem you are facing?

tema-popov commented 5 months ago

@the4thamigo-uk sure, it is a far broader topic, I agree it should be discussed separately.