dataform-co / dataform

Dataform is a framework for managing SQL based data operations in BigQuery
https://cloud.google.com/dataform/docs
Apache License 2.0
835 stars 161 forks source link

BigQuery: Support WRITE_TRUNCATE as writeDisposition mode #1359

Open dali-rmw opened 2 years ago

kolina commented 1 year ago

Hi, @dali-rmw! Can you clarify in which scenarios you'd like to use this setting? If I understand correctly from the docs passing writeDisposition=WRITE_TRUNCATE for BQ jobs allows rewriting a destination table for query results. But in Dataform you can create an action and set the type field of the query to table

config { type: "table" }
SELECT 1 AS TEST

In this case the output table for such action will be rewritten on every run so it is basically an equivalent thing. More details here.

If it's not that you want feel free to describe your scenario :)

bigrogerio commented 1 year ago

hi @dali-rmw! i use dataform in BigQuery/GCP too.

As @kolina said, when you define the config snippet as table, BigQuery always understands this as a writeDisposition=WRITE_TRUNCATE param.

If you want to append data to an existing table, you have to use theincremental type in the config snippet. This would be the same as writeDisposition=WRITE_APPEND param.

Best regards.

joajen commented 1 year ago

I have a scenario where WRITE_TRUNCATE and config{type:"table"} are not exactly the same thing.

I have a BQ materialized view which uses a particular table table_A as one of its sources. Table_A gets its data overnight via a script which uses write_truncate to overwrite the data in the table with new data.

Now, when I try that with Dataform config{type: "table"} it actually does a REPLACE of Table_A.

The problem is that materialized views do not like to have their source tables changed. If you try to select from that materialized view you get Materialized view project_A.schema_A.mat_view references table project_A.schema_A.table_A which was deleted and recreated. The view must be deleted and recreated as well.

While I could drop and create the materialized view, this particular one is expensive to run as a one off - we end up not really getting the benefits of the materialized view.

andres-lowrie commented 1 year ago

@joajen This is an interesting use case I'm curious would the expected output be that dataform create the materialized view and pass along options similar to these to BigQuery api

https://cloud.google.com/bigquery/docs/materialized-views-manage#refresh

or are you talking about that table_A upstream dependency of the mat view and would like dataform to write_trunc that table instead of create or replace it?

joajen commented 1 year ago

@andres-lowrie Thank you. The second point. I want the upstream dependency (ie base table table_A) to be write_truncate.

BQ has limitations on the materialized views - if you delete and recreate a base table the materialized view refresh will fail. See the last point at https://cloud.google.com/bigquery/docs/materialized-views-intro#limitations

dali-rmw commented 10 months ago

Hello,

As mentioned by @joajen, the behavior of a table type involves the deletion and recreation of the table. The issue I'm facing is that I create my tables using Terraform. When I specify the type as "table" in Dataform, it results in Dataform deleting and recreating the table. Consequently, my Terraform state registers this as a deleted table and attempts to recreate it each time I run Terraform.

Ekrekr commented 6 months ago

This is a very reasonable request, but it's complex in its implementation, because we can either:

However these options would be best defined at the config per-table level, but:

I'll keep thinking about what the best solution for this would be.

Ekrekr commented 5 months ago

Note: this implementation for this is similar to the request in https://github.com/dataform-co/dataform/issues/1385, as it interacts with the API in the same way.