Open SBurwash opened 1 year ago
FYI dbt-bigquery emits information about slot_millis
. As far as I am concerned, we technically embed the information as well.
We I execute DBT run, it prints the following to the log:
24 of 26 OK created sql incremental model <model name>............... [CREATE TABLE (0.0 rows, 7.6 KB processed) in 3.15s]
3 of 4 OK created sql incremental model <model name> ........... [SCRIPT (4.8 MB processed) in 3.69s]
I wish to have a test that collect the bytes processed, and detect anomalies over it. It can help me track my models creation costs and find bugs.
As noted in (almost) duplicate issue #617, we also want this for test-runs. A wrongly configured test could rack up the BigQuery bill.
I would be interested in that feature as well!
I guess the PR https://github.com/elementary-data/dbt-data-reliability/pull/146/ could serve as a blueprint for this implementation.
Maybe in a similar way, we could extract the _bytes_scanned
from the run_results.json.
Here is a real example, just model and package names are anonymized:
"results": [
{
"status": "success",
"timing": [
{
"name": "compile",
"started_at": "2023-02-02T11:50:38.273113Z",
"completed_at": "2023-02-02T11:50:38.286318Z"
},
{
"name": "execute",
"started_at": "2023-02-02T11:50:38.286789Z",
"completed_at": "2023-02-02T11:51:21.805887Z"
}
],
"thread_id": "Thread-1",
"execution_time": 43.534241676330566,
"adapter_response": {
"_message": "SCRIPT (495.1 GB processed)",
"code": "SCRIPT",
"bytes_processed": 531637247175
},
"message": "SCRIPT (495.1 GB processed)",
"failures": null,
"unique_id": "model.my_package.my_model_name"
}
],
Hi @jups23,
One of the challenges here is that dbt adapters don't have a standard response.
Snowflake returns the query_id
and others don't, Bigquery returns the bytes_processed
and others don't.
I do think what we did with adding the query_id only for Snowflake is the pragmatic approach, and I also see it makes Elementary users request dbt and the adapter developers to add this to their code. So maybe it will make an impact.
Anyway if you want to contribute and add the BQ bytes_processed
I think it has a lot of demand already.
@Hadarsagiv @oravi any objections here?
Hi @jessicaoblira, This is a popular request and it would be awesome if you would add it, many users would be happy! I think what I would do is add the entire adapter response as a field, because there are different fields returned by each adapter. Later on we could think if we want to create a view with the flattened response for each adapter, or add fields to the main table that would be null if the adapter does not return the field.
This PR to add the query_id
to the run_results
could serve as an example to how you should implement this 😁
@Maayan-s Can you tell me how we can implement the feature for BigQuery?
Is your feature request related to a problem? Please describe. Our goal is to obtain metrics that are pertinent to billing costs in a warehouse, ex:
Describe the solution you'd like As with the model run tab, we would like to have a tab which displays "number of bytes proccessed" or "estimated query costs" so that we can see which model costs us the most.
Beign able to sum up these metrics by amount of times run over a time period would also be incredible.
Describe alternatives you've considered We are looking at having our own in-house alternative following these tutorials:
Additional context Slack thread where this was discussed: https://elementary-community.slack.com/archives/C03LPJ5EYU9/p1669213639995469
Would you be willing to contribute this feature? We are in contact with the BigQuery team to see if they have an out-of-the-box solution. We will keep you guys updated.
ELE-60