Report bytes processed after running BigQuery SQL

watermanio commented 4 years ago

Describe the feature

As BigQuery bills users based on the volume of data processed, it's fairly important for users to be aware of this while developing models and pipelines. In the BQ UI this is fairly clearly presented after each query run, and the same information is available in the jobs.query API response.

You can find the totalBytesProcessed documented in the linked API documentation.

It would be helpful to see this output in the JSON logging - example:

{"timestamp": "2020-01-30", "message": "13:25:37 | 24 of 27 OK created incremental model dbt_andy.customers............. [CREATE TABLE (450) in 2.82s]", "channel": "dbt", "level": 11, "levelname": "INFO", "extra": {"unique_id": "model.customers", "bytes_processed": 4882}}

And also at the end of a run with the default log formatter:

13:16:52 | Finished running 23 incremental models, 4 table models, 1 hook in 35.87s
13:16:52 | 750mb of data processed

Describe alternatives you've considered

If you have enabled Export To BigQuery within your Google Cloud billing project, it would be possible to query how much data the service account your DBT user has processed. This could then be queried and monitored outside of DBT, but it would be so much helpful to have an indication of query size in realtime - this is an important optimisation when building BigQuery pipelines.

Who will this benefit?

This is very important for all users of BigQuery who have enough data stored to have a reasonable spend, but who are not so large that they have purchased commitments (flat fee / all you can eat) from Google. Even then, it's still best practice to pay attention to how much data you are processing...

tnightengale commented 4 years ago

I would really like to see something like this as well. I am investigating a strategy to monitor and alert our org of query costs, whilst using dbt.

drewbanin commented 4 years ago

Thanks for opening this issue @tnightengale! I'm super into this idea (see also dbt-labs/dbt#401)

The questions here have always been 1) when does this code run and 2) how is the information rendered.

I'm definitely into the idea of recording the bytes billed in the structured logs and CLI output.

I also really like the idea of fetching the estimated bytes billes on BQ before queries are run, but I'm not sure how/where they should be rendered out to users. Do y'all have any opinions on this?

watermanio commented 4 years ago

1) when does this code run

@drewbanin I think the request for this information is already made roughly here, but the relevant bit of the query_job object is not rendered.

I think the structured log bit might be fairly straightforward, just printing the bytes_billed at the time of querying. But the total bytes billing during a DBT run would need storing and calculating somewhere new.

I love your idea of estimated bytes billed - this is also available through the BigQuery API by making a dry_run request - I see there is already a very basic WIP PR for that here: https://github.com/fishtown-analytics/dbt/pull/1567

ryantuck commented 4 years ago

I can't tell if this is possible already, but it would be helpful to get this information prior to a run to estimate cost for a query like I can do via the UI.

This would essentially mimic the --dry-run flag available to the bq CLI.

My workaround currently is copying/pasting compiled SQL into the UI, which feels less-than-elegant.

drewbanin commented 4 years ago

@ryantuck this is pretty doable if the run is for a single model in a built-out dataset, but it becomes really challenging in the more general case. What happens if:

You're running dbt for the first time
You changed two models, A and B, with A ---> B. A has been changed to return 1 trillion rows, and B is selecting out of A

How can dbt leverage BQ's dry run capabilities to return realistic byte counts in cases like these? Seems to me like we'd either need to acknowledge the major caveats, or add some severe guardrails to make sure people don't misinterpret the outputs

ryantuck commented 4 years ago

@drewbanin oh, that's a good call-out. Two technical solutions come immediately to mind:

Limit the estimate-cost function to exactly one model. This seems fine, but subpar.
Make the estimate-cost function assume all models are ephemeral, which would then provide a compiled SQL query that goes fully back to existing sources so BQ can figure it out how much you need to scan. This seems like the better approach.

Haven't thought those through too much, so I might be missing something crucial :)

ekhaydarov commented 4 years ago

Would it be possible to get the run time of the job too as a second dimension? happy to help contribute

ekhaydarov commented 4 years ago

Actually, not sure it should be part of dbts remit? since this is isolated to big query it seems like a preference feature of a particular data warehouse. Its actually pretty straightforward to get your data out if you have cloud billing audit logs enabled. Here's an example query by google

You can then just put a model in dbt to make this table and youre good to go to query it as much as you want and set alerts on top of that if runtime slots exceed x seconds or bytes billed exceeds a threshold etc.

jml commented 4 years ago

At least for us switching to dbt, not having this is a step backwards. We have cloud billing audit logs enabled, but we very much value being able to see the cost of a job in the same place as the output of the job itself.

hui-zheng commented 4 years ago

Using audit log is also our current approach. Actually, recently BigQuery released job meta data using INFORMATION_SCHEMA. It makes this kind of queries even more accessible and easier. https://cloud.google.com/bigquery/docs/information-schema-jobs

tuanchris commented 3 years ago

I've got this script to parse the log file and print out the total totalBytesProcessed. Would love to see this at the end of a run.

import re

with open('./log.txt', 'r') as f:
    logs = f.read()

bytes_scanned = re.findall(', (.*?) processed', logs)

units = {"Bytes": 1, "KB": 10**3, "MB": 10**6, "GB": 10**9, "TB": 10**12}

def parse_size(size):
    number, unit = [string.strip() for string in size.split()]
    return int(float(number)*units[unit])

def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

total_bytes = list(map(parse_size, bytes_scanned))
sizeof_fmt(sum(total_bytes))

jtcohen6 commented 3 years ago

Given that this information has been available in the results object for several versions now (within adapter_response.bytes_processed), I think this one could be a good first issue.

I'm going to transfer this into dbt-bigquery repo. However, this will require an adjustment to the shared method print_results_line, defined in core/dbt/task/run.py, to enable override (or additional information) provided by the adapter.

github-actions[bot] commented 2 years ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

elyobo commented 1 year ago

Still seems worth doing bot, please reopen.

Similar to the python script above, here's a quick hack in JS (expects the log to be piped to it, e.g. save as summary.mjs and run like node summary.mjs < log.txt).

#!/usr/bin/env node

/**
 * Hack to extract data from `dbt` log data and provide summary stats.
 *
 * Capture the dbt logs and then pipe them to this, e.g.
 *
 * ```sh
 * dbt run | tee dbt.log
 * ./bin/dbt-log-summary.mjs < dbt.log
 * ```
 */

import fs from 'fs'

const PROCESSED_LINE_RE = /created sql.*model ([a-zA-Z0-9_\.]+).*\(([1-9\.]+)\s*([a-zA-Z]*) rows, ([0-9\.]+)\s*([a-zA-Z]+) processed\).*in ([0-9\.]+)\s*([a-zA-Z]*)/

const COST_PER_TiB = 5

const multipliers = {
  '': 1,
  k: 1_000,
  m: 1_000_000,
  s: 1,
  KB: 1_000,
  MB: 1_000_000,
  GB: 1_000_000_000,
  TB: 1_000_000_000_000,
  KiB: 1_024,
  MiB: 1_024 ** 2,
  GiB: 1_024 ** 3,
  TiB: 1_024 ** 4,
}

const multiply = x =>  multipliers[x]
const sum = arr => arr.reduce((acc, v) => acc + v, 0)
const round = (value, places) => Math.round((value + Number.EPSILON) * (10 ** places)) / (10 ** places)
const format = (value, places = 2) => round(value, places).toLocaleString('en-AU')
const simpleFormat = num => num.toLocaleString('en-AU', { notation: 'compact' })

// Pipe from stdin expected
const stdin = await fs.readFileSync(0, 'utf8')
const lines = stdin.split('\n')
const processed = lines
  .map(line => {
    const [
      , model, _rows, rowsUnits, _data, dataUnits, _duration, durationUnits,
    ] = line.match(PROCESSED_LINE_RE) || []
    if (!model) return

    const multiples = [rowsUnits, dataUnits, durationUnits].map(multiply)
    const [rows, data, duration] = [_rows, _data, _duration].map(Number).map((v, i) => v * multiples[i])

    return {
      model,
      _rows, rowsUnits, rows,
      _data, dataUnits, data,
      _duration, durationUnits, duration,
      line,
    }
  })
  .filter(Boolean)

if (!processed.length) {
  console.warn(`No data extracted from ${lines.length} line(s)!`)
  console.log(stdin)
  process.exit(1)
}

const assert = key => (item) => {
  const val = item[key]
  if (Number.isNaN(val) || val === undefined) {
    throw new Error(`${key} is invalid in ${JSON.stringify(item)}`)
  }
}

;['data', 'rows', 'duration'].map(key => processed.map(assert(key)))

const totalData = sum(processed.map(p => p.data))
const totalRows = sum(processed.map(p => p.rows))
const totalDuration = sum(processed.map(p => p.duration))

console.log(`Appx. cost: USD $${round(totalData / multipliers.TiB * COST_PER_TiB, 2)}`)
console.log(`Total data: ${simpleFormat(totalData)} bytes (${format(totalData / multipliers.TiB)} TiB)`)
console.log(`Total rows: ${simpleFormat(totalRows)}`)
console.log(
  `Total duration: ${simpleFormat(totalDuration)} seconds (${simpleFormat(Math.floor(totalDuration / 60))} minute(s), ${round(totalDuration % 60, 0)} second(s))`,
  '\n* execution time (some concurrent, not start to end time)'
)

console.log('\nmost data')
console.log(processed.sort((a, b) => b.data - a.data).slice(0, 10).map(r => r.line).join('\n'))

console.log('\nmost rows')
console.log(processed.sort((a, b) => b.rows - a.rows).slice(0, 10).map(r => r.line).join('\n'))

console.log('\nlongest duration')
console.log(processed.sort((a, b) => b.duration - a.duration).slice(0, 10).map(r => r.line).join('\n'))

elyobo commented 1 year ago

There's also no logging at all of this data usage from tests, which is a potential cost iceberg (especially with BQ and some of the tests having had SELECT * behaviour at times), exposing that information would be fantastic as well. I don't know if that's also hidden in the internals somewhere as @jtcohen6 mentions is the case for the run stuff (I guess it would be if it's being collected in the adapter?).

dbt-labs / dbt-bigquery