dbt-labs / dbt-codegen

Macros that generate dbt code
https://hub.getdbt.com/dbt-labs/codegen/latest/
Apache License 2.0
459 stars 99 forks source link

`generate_model_yaml` does not escape descriptions, leading to invalid and potentially unsafe YAML code #142

Closed wircho closed 6 months ago

wircho commented 10 months ago

Describe the bug

The generate_column_yaml macro simply appends a description to the generated YAML code in this line of code:

    {% do model_yaml.append('        description: "' ~ column_desc_dict.get(column.name | lower,'') ~ '"') %}

This description is usually an arbitrary upstream description that may contain double quotes ("), which can easily break the YAML. They can also be used to generate potentially unsafe YAML.

Steps to reproduce

Step 1: Two models:

models/model1.sql

SELECT 1 AS a

models/model2.sql

SELECT * FROM {{ ref('model1') }}

Step 2: A yaml file:

models/model1_schema.yml

version: 2
models:
- name: model1
  columns:
  - name: a
    description: >
      Some complex description containing "double quotes".

Step 3: Reproducing the bug:

$ dbt run -s +model2
$ dbt run-operation codegen.generate_model_yaml --args '{"model_names": ["model2"], "upstream_descriptions": true}'

Optionally, write the output into a file models/model2_schema.yml and see that dbt parse fails due to a YAML parsing error.

Expected results

dbt run-operation codegen.generate_model_yaml should never generate invalid YAML.

Actual results

dbt run-operation codegen.generate_model_yaml sometimes generates invalid (or potentially dangerous) YAML.

Screenshots and log output

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/codegen
    version: 0.11.0

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.5.2
  - latest:    1.6.5 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 1.5.3 - Update available!

The operating system you're using:

macOS 13.5.1

The output of python --version:

Python 3.11.1

Additional context

There is a simple fix for this; replacing this line of code:

    {% do model_yaml.append('        description: "' ~ column_desc_dict.get(column.name | lower,'') ~ '"') %}

with this safer line of code:

    {% do model_yaml.append('        description: ' ~ (column_desc_dict.get(column.name | lower,'') | tojson)) %}

The tojson filter takes care of quoting and escaping the string, producing safe and valid YAML.

Are you interested in contributing the fix?

For sure. I'd love to submit a PR if you believe that's useful.