dataform-co / dataform

Dataform is a framework for managing SQL based data operations in BigQuery
https://cloud.google.com/dataform/docs
Apache License 2.0
838 stars 160 forks source link

Question: assertion query optimization #1736

Open federicojasson opened 4 months ago

federicojasson commented 4 months ago

Hello 👋

When one defines rowConditions, a view is created with the following query:

SELECT
  '<condition>' AS failing_row_condition,
  *
FROM dataset
WHERE NOT (<condition>)
UNION ALL
...

Reference: https://github.com/dataform-co/dataform/blob/f23795ab4424ac592a8ced9b49d11fa2790882ca/core/compilation_sql/index.ts#L42-L44

I was wondering if the * is really necessary here.

We're trying to follow BigQuery best practices regarding queries, and this forces us to define manual assertions.

I apologize if this has already been answered before, I couldn't find a related issue. Thank you!

BenBirt commented 4 months ago

It's definitely not necessary!

I think the reason it is there is to make debugging the assertion - when it fails - easier.

With a "normal" assertion (i.e. manually-defined), the user gets to choose what column(s) to select into the assertion's view. (Presumably those that are likely to help debug the issue.) Config-block assertions obviously don't have any way for the user to express that, so (I think) we chose to select all columns to aid debuggability.

I'm not against removing it, but do note that SELECT * in a view should not cost anything (because creating a view doesn't actually query any data).

federicojasson commented 4 months ago

Ohh I see, it's for debugging purposes.

I'm not against removing it, but do note that SELECT * in a view should not cost anything (because creating a view doesn't actually query any data).

It does query the data when the assertions are run, which I think is always the case for built-in assertions, correct?

But the only solution I can think of is to have a different query for running the assertions than for the view (used for debugging). I'm not sure how would that work or how hard is to implement that though.

BenBirt commented 4 months ago

It does query the data when the assertions are run, which I think is always the case for built-in assertions, correct?

D'oh, of course, you're absolutely right!

Thanks for the discussion - we will keep this in mind and see if we can figure out options here.

For now though, if you want to avoid SELECT *, I think your best bet is to use "standard" assertions. (And note that if you have common pattern(s), you can always extract those out using JavaScript.)