dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
217 stars 153 forks source link

[CT-264] [CT-262] [Feature] Support BigQuery labels for seeded models #125

Closed darrylng closed 2 years ago

darrylng commented 2 years ago

Is there an existing feature request for this?

Describe the Feature

Current models support +labels: config as follows:

models:
  my_project:
    snowplow:
      +labels:
        domain: clickstream

However, seeded models do not currently support this, at least not that I am aware. Having the +labels: config under seeds: do not currently result in labels being set when running dbt seed to the database.

Describe alternatives you've considered

I have written a macro that is always run as a post-hook for all seeded models.

{% macro set_labels_to_seed_models(model_ref) -%}
    {% if execute %}
        {{ log("set_labels_to_seed_models: About to add labels to model: " ~ model_ref) }}
        {% set set_label_to_model_query %}
            ALTER TABLE {{ model_ref }}
            SET OPTIONS (
                labels=[("domain", "clickstream")]
            )
        {% endset %}
        {%- set result = run_query(set_label_to_model_query) -%}
    {% endif %}

{%- endmacro %}

And in dbt_project.yml, I attach the macro as a post-hook to all models in seeds config:

seeds:
  my_project:
    snowplow:
        +post-hook: "{{ set_labels_to_seed_models(this) }}"

Who will this benefit?

BigQuery users who need to seed data into the database but also need labels applied.

Are you interested in contributing this feature?

Yes, if this is a feature that is wanted.

Anything else?

No response

jtcohen6 commented 2 years ago

@darrylng Thanks for opening! I'm going to transfer this issue over to the dbt-bigquery repo, since that's where the code change will need to happen. There are a few pieces to this, which I'll try to walk through below.

Generally, labels are added as part of the create table as + create view as statements, via the options clause:

https://github.com/dbt-labs/dbt-bigquery/blob/d5d94f9ca5ab6e7380d705420f5681bd84e0b2a9/dbt/adapters/bigquery/impl.py#L732-L734

However, when BigQuery creates a seed, it doesn't use a create table as statement. (The bigquery__create_csv_table macro exists, since it's called by the seed materialization, but it doesn't do anything.) Instead, we use the load_dataframe method, which ultimately calls the BigQuery Python client's load_table_from_file method.

https://github.com/dbt-labs/dbt-bigquery/blob/d5d94f9ca5ab6e7380d705420f5681bd84e0b2a9/dbt/include/bigquery/macros/materializations/seed.sql#L2-L4

https://github.com/dbt-labs/dbt-bigquery/blob/d5d94f9ca5ab6e7380d705420f5681bd84e0b2a9/dbt/adapters/bigquery/impl.py#L668-L687

After that method is called, we do run some alter statements, if persist_docs is enabled, for the purposes of persisting descriptions as table-level and column-level comments:

https://github.com/dbt-labs/dbt-bigquery/blob/d5d94f9ca5ab6e7380d705420f5681bd84e0b2a9/dbt/include/bigquery/macros/materializations/seed.sql#L15-L18

So I think the approach here could look like:

  1. Calling alter table {{ model }} set {{ bigquery_table_options(config, model, temporary) }} within bigquery__load_csv_rows, after load_dataframe
  2. Finding a way to pass labels (and other table options) within the Python code of load_dataframe

Personally, I much prefer the first option! I think this could be a straightforward addition.

I don't see any great integration tests in the plugin today for validating that labels have been applied to models configured with labels, but it should be as simple as a query to the information schema to verify the presence of the expected label: https://cloud.google.com/bigquery/docs/information-schema-tables#table_options_view

With that, I'm going to transfer and mark this one a good first issue.

darrylng commented 2 years ago

Awesome! Thanks @jtcohen6 for your response, I will try and tackle it when I have some time.

darrylng commented 2 years ago

@jtcohen6 I have opened a PR to address this issue, will appreciate your review and feedback. I also noticed that the .changes and changie.yaml files are not committed to this repo, so the CHANGELOG.md was manually edited instead of generated by changie.