[ADAP-912] Support Metadata Freshness

peterallenwebb commented 1 year ago

Describe the feature

Support metadata-based freshness by implementing the new macro and feature flag described in https://github.com/dbt-labs/dbt-core/issues/7012.

Who will this benefit?

Everyone who wants faster freshness results from BigQuery!

adamcunnington-mlg commented 12 months ago

I understand this won't make 1.7 but is it committed to for 1.8? It would have such a huge impact on us (Medialab)'s use of DBT.

Happy to support with info on where to find this in BQ information schema tables. I submitted the original request which has made it into 1.7 - but just need the BQ support - thanks!

MichelleArk commented 5 months ago

I've done some spiking and rough benchmarking of whether a batch-implementation of metadata-based freshness would lead to performance gains here, and my conclusion is that there isn't currently a way to implement a batch-strategy that achieves performance improvements for metadata-based source freshness given limitations of BigQuery's Python SDK.

Details

The current implementation of metadata-based source freshness iterates over selected sources, and retrieves the freshness metadata for each source by fetching a BigQuery Table object, and accessing the modified attribute to build a freshness response.

It is significantly faster than using a loaded_at_field because the metadata-based strategy directly interacts with the BigQuery Python SDK instead of waiting for a costly query to finish running. However, it still requires 1 API call per selected source.

Batch Implementation -- Approach 1

A simple implementation would be to iterate over the selected. This works, but is fundamentally the same approach as the current implementation, just called earlier on during execution. It would still require the same 1 API call per selected source, and so does not lead to significant performance improvements. This was roughed-in here: https://github.com/dbt-labs/dbt-bigquery/pull/1232/commits/29e533f33ecdce77b2b97cf43a7dc29728230d76

From my benchmarking (running 10 invocations of each implementation on 6 sources in jaffle-shop), the current (non-batch implementation) took an average of 2.57s while the batch implementation took an average of 2.62s. The fastest non-batch invocation took 2.35s, and the fastest batch invocation took 2.17s. I believe the variability here actually comes from variability in BigQuerys get_table API, and not the difference in implementation strategies.

Batch Implementation -- Approach 2

Another implementation would be to fetch all the datasets available to the client upfront, and for each dataset, fetch all its tables. This would (in theory) only require 2 API calls per dataset the client has access to regardless of the number of sources selected. I've roughed this in here: https://github.com/dbt-labs/dbt-bigquery/pull/1232/commits/b1035e2ec4ac98bfc64f3fb717f88f55e0b85f81

However, the objects (DatasetListItem, TableListItem) returned from these two API calls are limited by design, and do not include critical metadata for this functionality (e.g. no modified). From the BQ docs:

A read-only table resource from a list operation. For performance reasons, the BigQuery API only includes some of the table properties when listing tables. Notably, xref_schema and xref_num_rows are missing.

We'd end up needing to fetch the Table object for each TableListItem, getting us back to at least 1 API call per source (details).

If there are other implementation options that would get us to a constant number of API calls for the batch of sources, I'm super open to hearing them but from my exploration so far -- it does not seem worthwhile to implement a naive batch implementation strategy because it is more complex but with negligible performance gains.

adamcunnington-mlg commented 5 months ago

@MichelleArk this is a super terrifying conclusion - and it was batch via BQ from my initial feature request that has driven all of this work around source freshness which have been much anticipated since I raised the issue 15 months ago. I have a team week planned 3rd June where we are hoping to finally shift to the benefits of this whole endeavour.

In the linked FR, there is correspondence in the comments about how to get this exact information from the INFORMATION_SCHEMA table:

BQ docs are now updated; https://cloud.google.com/bigquery/docs/information-schema-table-storage#schema

Your 2 investigated approaches suffer from the same limitation which is that you're iterating through table metadata - I think that's the antithesis of the batch approach here. The whole point would be to query the single information schema table once.

:(

MichelleArk commented 5 months ago

@adamcunnington-mlg -- thanks for linking the updated docs, and sorry I hadn't been aware of the INFORMATION_SCHEMA.TABLE_STORAGE option. I hear the disappointment re: not having this implemented yet as part of the recent 1.8 releases.

I did some further spiking using the and after some tweaking I think we have a promising path forward in dbt-bigquery, without requiring any changes in the base dbt-adapters spec or dbt-core.

An initial naive implementation of bigquery__get_relation_last_modified (https://github.com/dbt-labs/dbt-bigquery/pull/1234/commits/3b28625102d8ce813f16ab4740c5ba2c7fa0015d), following the same approach we take in dbt-snowflake and dbt-redshift was an improvement on the current SDK-based implementation, however the query time was still to scaling linearly in proportion to the number of selected sources. From some simple testing using a project with 50 sources, the batch implementation took ~9s, while the non-batch SDK took ~20s. With 100 sources, the batch implementation took ~12.5s, while the non-batch SDK took ~28s (thank you @jtcohen6 for setting up the test sources!)

However, filtering within the metadata macro by relation schema name in addition to filtering for exact schema & table name matches (https://github.com/dbt-labs/dbt-bigquery/pull/1234/commits/51243e531e30d204c5c5f8cf83b395ed7d1e90ea#) does the trick in getting the query time constant! Both a project with 50 sources and 100 sources took just under 5s to complete dbt source freshness. I wonder if there are other such optimizations we could make to the query to further improve the query latency...

Note that this was all done while hard-coding the information_schema in the batch macro, and this will probably need to be revisited for correctness to ensure the correct information schema is used for sets of selected sources which may lead to increased latency of the command (one query per information schema).

Based on the benchmarking in this spik, I'd expect that a project with ~1000 sources would take about 3 minutes to compute source freshness using the current implementation (non-batch). @adamcunnington-mlg if you can share, does that line up with what you're seeing in a production project? Would be awesome to get sense of the performance impact these changes would have on some real-world projects.

adamcunnington-mlg commented 5 months ago

@MichelleArk thanks for the fast response on this - it's much appreciated - and this sounds promising.

I'm not close to how the source freshness process actually works so please forgive any naivety here in my comments/questions:

I guess you would want to translate the region from the location if specified in https://docs.getdbt.com/docs/core/connect-data-platform/bigquery-setup#dataset-locations - if not, I think it defaults to US - but don't quote me!
How many times would the information schema table be queried? In my mind, it needs to happen only once per region - and I think the assumption behind a dbt command invocation in BigQuery is it would be running in one region only, no? In my mind, you'd query the information schema table once and store the results (specific columns of interest) in some very simple cache (python data structure fine?) which you can then extract times from for each relevant source table. You only care about schema, table name and last modified I think?

Overall, I think I'd expect the source freshness time to be as slow as 1 query of information schema (which may be very loosely correlated with the size of that result but it's going to be relatively flat) plus the time for dbt internals to iterate through the tables and extract freshness information - a linear time but small. Overall, I think 3 minutes for 1000 tables surprises me. I'd be expecting more like 10-20 seconds? 3 minutes is definitely better than our current 15 but not quite what I was expecting. I suspect I'm missing something fundamental here though in how this process could work.

jtcohen6 commented 5 months ago

Thank you for the spike and the write-up @MichelleArk!

Other considerations for this approach:

BigQuery has a character length limit for queries, and so templating many thousands of source table names into the query risks hitting this limit. (This is still better than loading tons of metadata into memory about table we don't care about, as a result of not pushing the predicates down into BQ.) We don't have an existing pattern for running parametrized queries.
Querying INFORMATION_SCHEMA.TABLE_STORAGE is not free. In our spike, with a few thousand sources, each query scanned 2GB — not a lot, but potentially more in very large datasets. By comparison, the current approach in main (Python SDK) is free.
I believe this would require one information schema query per BQ project configured with sources (not once per region). We have a (slightly janky) existing pattern for interpolating region into the query, which we can pursue if we're looking to put this code into production. It's not necessary for the investigation to answer the question, "Is this substantially better?"

@adamcunnington-mlg Seconding Michelle, I would ask that you please try testing these out in your actual project with many many sources:

The current approach in dbt-bigquery==1.8.0 for metadata-based freshness (Python SDK). While this does scale linearly with the number of sources, each source takes ~0.5s to execute, and you can run with very high concurrency (e.g. --threads 50). Simply define freshness on your sources without defining a loaded_at_field, and for each source freshness execution, dbt will make an API call to calculate its freshness.

The approach outlined in batch-metadata-benchmarking, which uses the information schema table and may scale better:

python3 venv env
source env/bin/activate
pip install git+https://github.com/dbt-labs/dbt-bigquery.git@batch-metadata-benchmarking#egg=dbt-bigquery
dbt source freshness

If you could provide us with concrete numbers for both approaches, that would help me a lot in deciding on the appropriate next step here — to switch this behavior for everyone (strictly better), to consider implementing it as another configurable option (with both pros & cons relative to the v1.8 behavior), or to put it down for good ("do nothing").

dbt-labs / dbt-bigquery