dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
216 stars 149 forks source link

[Feature] Implement batch metadata freshness using `INFORMATION_SCHEMA.TABLE_STORAGE` instead of `client.get_table` #1239

Open jtcohen6 opened 5 months ago

jtcohen6 commented 5 months ago

Is this your first time submitting a feature request?

Describe the feature

Move from running a query per table to performing a batch operation.

Running a single query against INFORMATION_SCHEMA.TABLE_STORAGE appears to be the most efficient way to calculate freshness for all sources in a project that do not define a custom loaded_at_field. Evaluate if there is a python client API or if we need to run the query ourselves.

Copying from https://github.com/dbt-labs/dbt-bigquery/issues/938#issuecomment-2109370037:

I did some further spiking using the and after some tweaking I think we have a promising path forward in dbt-bigquery, without requiring any changes in the base dbt-adapters spec or dbt-core.

An initial naive implementation of bigquery__get_relation_last_modified (https://github.com/dbt-labs/dbt-bigquery/pull/1234/commits/3b28625102d8ce813f16ab4740c5ba2c7fa0015d), following the same approach we take in dbt-snowflake and dbt-redshift was an improvement on the current SDK-based implementation, however the query time was still to scaling linearly in proportion to the number of selected sources. From some simple testing using a project with 50 sources, the batch implementation took ~9s, while the non-batch SDK took ~20s. With 100 sources, the batch implementation took ~12.5s, while the non-batch SDK took ~28s.

However, filtering within the metadata macro by relation schema name in addition to filtering for exact schema & table name matches (https://github.com/dbt-labs/dbt-bigquery/pull/1234/commits/51243e531e30d204c5c5f8cf83b395ed7d1e90ea#) does the trick in getting the query time constant! Both a project with 50 sources and 100 sources took just under 5s to complete dbt source freshness.

And from https://github.com/dbt-labs/dbt-bigquery/issues/938#issuecomment-2109531126:

Other considerations for this approach:

  • BigQuery has a character length limit for queries, and so templating many thousands of source table names into the query risks hitting this limit. (This is still better than loading tons of metadata into memory about table we don't care about, as a result of not pushing the predicates down into BQ.) We don't have an existing pattern for running parametrized queries.
  • Querying INFORMATION_SCHEMA.TABLE_STORAGE is not free. In our spike, with a few thousand sources, each query scanned 2GB — not a lot, but potentially more in very large datasets. By comparison, the current approach in main (Python SDK) is free.
  • I believe this would require one information schema query per BQ project configured with sources (not once per region). We have a (slightly janky) existing pattern for interpolating region into the query, which we can pursue if we're looking to put this code into production. It's not necessary for the investigation to answer the question, "Is this substantially better?"

Describe alternatives you've considered

Existing implementation: non-batch, leveraging BigQuery API. This is likely preferable in cases where a project contains only a few sources.

Who will this benefit?

BigQuery users with lots of sources, who want to calculate freshness for them all at once

Are you interested in contributing this feature?

No response

Anything else?

Spike:

adamcunnington-mlg commented 4 months ago

We tested the branch (pip install git+https://github.com/dbt-labs/dbt-bigquery.git@batch-metadata-benchmarking#egg=dbt-bigquery) and can confirm a fantastic improvement! image

1156 sources completed in 35 seconds. I ran it twice more and time was very similar. Current time spans between 4 minutes and 12 minutes due to query volatility - but new solution reduces to 1 query which not only is a massive speed up but a huge reduction in volatility. This is a big win.

The only thing I've not yet fully validated is the accuracy of the reported time from the INFORMATION_SCHEMA table. BQ docs state that "The data in the INFORMATION_SCHEMA.TABLE_STORAGE view is not kept in real time, and updates are typically delayed by a few seconds to a few minutes". We use the source_status:fresher+ selector in our build jobs so the impact of this is we might not catch things that have changed immediately but they'd be picked up by our subsequent run - so potentially a small latency impact but a reasonable price for the improvement.

adamcunnington commented 2 months ago

@jtcohen6 @MichelleArk please can you remind me what is waiting for this feature to be merged?