[ELE-60] Obtain model run billing metrics

SBurwash commented 1 year ago

Is your feature request related to a problem? Please describe. Our goal is to obtain metrics that are pertinent to billing costs in a warehouse, ex:

Snowflake: Query time
BigQuery: Bytes processed, Storage costs

Describe the solution you'd like As with the model run tab, we would like to have a tab which displays "number of bytes proccessed" or "estimated query costs" so that we can see which model costs us the most.

Beign able to sum up these metrics by amount of times run over a time period would also be incredible.

Describe alternatives you've considered We are looking at having our own in-house alternative following these tutorials:

Additional context Slack thread where this was discussed: https://elementary-community.slack.com/archives/C03LPJ5EYU9/p1669213639995469

Would you be willing to contribute this feature? We are in contact with the BigQuery team to see if they have an out-of-the-box solution. We will keep you guys updated.

_ELE-60

yu-iskw commented 1 year ago

FYI dbt-bigquery emits information about slot_millis. As far as I am concerned, we technically embed the information as well.

https://github.com/dbt-labs/dbt-bigquery/issues/194

rantibi commented 1 year ago

We I execute DBT run, it prints the following to the log:

24 of 26 OK created sql incremental model <model name>............... [CREATE TABLE (0.0 rows, 7.6 KB processed) in 3.15s]
3 of 4 OK created sql incremental model <model name> ........... [SCRIPT (4.8 MB processed) in 3.69s]

I wish to have a test that collect the bytes processed, and detect anomalies over it. It can help me track my models creation costs and find bugs.

bendiktv2 commented 1 year ago

As noted in (almost) duplicate issue #617, we also want this for test-runs. A wrongly configured test could rack up the BigQuery bill.

willi-mueller commented 1 year ago

I would be interested in that feature as well!

I guess the PR https://github.com/elementary-data/dbt-data-reliability/pull/146/ could serve as a blueprint for this implementation.

Maybe in a similar way, we could extract the _bytes_scanned from the run_results.json.

Here is a real example, just model and package names are anonymized:

 "results": [
        {
            "status": "success",
            "timing": [
                {
                    "name": "compile",
                    "started_at": "2023-02-02T11:50:38.273113Z",
                    "completed_at": "2023-02-02T11:50:38.286318Z"
                },
                {
                    "name": "execute",
                    "started_at": "2023-02-02T11:50:38.286789Z",
                    "completed_at": "2023-02-02T11:51:21.805887Z"
                }
            ],
            "thread_id": "Thread-1",
            "execution_time": 43.534241676330566,
            "adapter_response": {
                "_message": "SCRIPT (495.1 GB processed)",
                "code": "SCRIPT",
                "bytes_processed": 531637247175
            },
            "message": "SCRIPT (495.1 GB processed)",
            "failures": null,
            "unique_id": "model.my_package.my_model_name"
        }
    ],

Maayan-s commented 1 year ago

Hi @jups23, One of the challenges here is that dbt adapters don't have a standard response. Snowflake returns the query_id and others don't, Bigquery returns the bytes_processed and others don't.

I do think what we did with adding the query_id only for Snowflake is the pragmatic approach, and I also see it makes Elementary users request dbt and the adapter developers to add this to their code. So maybe it will make an impact.

Anyway if you want to contribute and add the BQ bytes_processed I think it has a lot of demand already. @Hadarsagiv @oravi any objections here?

Maayan-s commented 1 year ago

Hi @jessicaoblira, This is a popular request and it would be awesome if you would add it, many users would be happy! I think what I would do is add the entire adapter response as a field, because there are different fields returned by each adapter. Later on we could think if we want to create a view with the flattened response for each adapter, or add fields to the main table that would be null if the adapter does not return the field.

This PR to add the query_id to the run_results could serve as an example to how you should implement this 😁

yu-iskw commented 1 year ago

@Maayan-s Can you tell me how we can implement the feature for BigQuery?

elementary-data / elementary

[ELE-60] Obtain model run billing metrics #420