`generate_model_yaml macro` to take original yaml file as an additional argument

charlespicowski commented 2 years ago

Looking at the generate_model_yaml macro, I can see that it takes the following arguments:

model_name
upstream_descriptions

As I understand it, running

dbt run-operation generate_model_yaml --args '{"model_name": "dim_customer"}'

will produce a yaml similar to the following

version: 2

models:
  - name: dim_customer
    description: ""
    columns:
      - name: dim_customer_id
        description: ""

      - name: customer_id
        description: ""

      - name: customer_first_name
        description: ""

...

But what if there is an already existing yml file for dim_customers such as...

...
  - name: dim_customer
    description: ""
    columns:
      - name: dim_customer_id
        description: "this is my description"
        meta:
           - something_else:
        tests_and_other_things...:
...

Is it possible we can take these original descriptions, tests and meta things too for the new output produced by the generate_model_yaml macro?

Why someone would want this: Most of the time you are not in fact generating a model yaml from scratch, there is an already existing one, and you have made changes to it. It's really a quality of life improvement, but it would greatly improve one's workflow and make it easier to keep documentation best practices.

dbeatty10 commented 2 years ago

@charlespicowski if there were some way to left or full outer join two different YAML files into a new one, would that fill the need you are looking for?

Practically speaking, the "join" described above would probably be more like "merge file a.yml into file b.yml" or vice versa.

davesgonechina commented 1 year ago

I have a Github Action that uses yamlpath yaml-merge using a deep array of hash method that basically does the left join that @dbeatty10 is talking about - basically the Action runs generate_source and then merges the old one. I haven't quite gotten it ready to share publicly - it's not 100% pretty, but it works.

I'm also currently experimenting with changes to generate-source() to do much the same thing using the same methods as generate_model_yaml. That method reads only the descriptions from the graph nodes and includes those values when appending the description: line to the yaml output. To expand this to include the other four possible fields for columns (meta, data_type, quote, and tags) is possible, but you'd also probably want to exclude appending empty fields (don't append meta if there's no existing value, for instance). The code gets pretty repetitive to write yaml you've already got (in the existing file) - though I think it might be possible to figure out an elegant helper function or two that can handle all of it cleanly.

Another wrinkle is it is difficult, maybe not possible, to preserve multi-line descriptions block scalars (the > followed by text on multiple lines for readibility) using the graph traversal approach of generate_model_yaml. Again, you're trying to reverse engineer generating yaml you've already got, so I'm considering pivoting back to working on the yamlpath Github Action rather than continue to develop with dbt's macro ecosystem.

VDFaller commented 1 year ago

I'm imagining this as a separate macro. update_yaml_files that would use a lot of the same functionality but would actually overwrite the file (or at least have the option to).

Caveat I have no idea about the inner workings of this package so don't know if this is possible given current base. But it has the assumption that the yamls can be merged.

Describe the feature

args
- file_names: the list of files you want to update (could we default None this and just do all of them)?
- upstream_descriptions: the same as generate_model_yaml
- overwrite: default false? but to overwrite the file

It would be great if it could

read all the models from the files selected
in dag order, run generate_model_yaml against that single model
- dag order because I'm imagining updating a source description and then chaining that description through multiple models
merge the generated with what is currently in the yml
- for exact collisions, generated always wins? unless generated is blank and in place is not?

github-actions[bot] commented 11 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 11 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

wjhrdy commented 9 months ago

I've created this. It is a separate python function to merge generated yaml with existing yaml.

import yaml
from collections import OrderedDict
import sys
import re

def selective_quote_presenter(dumper, data):
    if re.search(r"[\'\{\}\[\],:]", data) or data == '':  # Check if the string contains characters that are hard for yaml to parse
        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style=None)

yaml.add_representer(str, selective_quote_presenter)

# Add a constructor for OrderedDict to the yaml module
def dict_constructor(loader, node):
    return OrderedDict(loader.construct_pairs(node))

yaml.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, dict_constructor)

# Add a representer for OrderedDict to the Dumper class
def dict_representer(dumper, data):
    return dumper.represent_mapping(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, data.items())

yaml.Dumper.add_representer(OrderedDict, dict_representer)

# Get the file names from command line arguments
generated = sys.argv[1]
base = sys.argv[2]
output_file = sys.argv[3]

# Load the contents of the first file
with open(generated, 'r') as file:
    data1 = yaml.load(file, Loader=yaml.Loader)

# Load the contents of the second file
with open(base, 'r') as file:
    data2 = yaml.load(file, Loader=yaml.Loader)

# merge the two dictionaries
for model2 in data2['models']:
    model1 = next((item for item in data1['models'] if item['name'] == model2['name']), None)
    if model1:
        merged_model = {**model1, **{k: model2[k] for k in model2.keys() - {'columns'}}}
        model1.update(merged_model)

        for column2 in model2.get('columns', []):
            column1 = next((item for item in model1.get('columns', []) if item['name'] == column2['name']), None)
            if column1:
                merged_column = {**column1, **column2}
                column1.update(merged_column)
            else:
                model1.get('columns', []).append(column2)
    else:
        data1['models'].append(model2)

model_keys_order = ['name', 'tags', 'description', 'docs', 'latest_version', 'deprecation_date', 'access', 'config', 'constraints', 'tests', 'columns', 'versions']
column_keys_order = ['name', 'data_type', 'description', 'meta', 'quote', 'constraints', 'tests', 'tags']

# After merging, sort the keys
for model in data1['models']:
    model1 = OrderedDict(sorted(model.items(), key=lambda i: model_keys_order.index(i[0])))
    for column in model1.get('columns', []):
        column1 = OrderedDict(sorted(column.items(), key=lambda i: column_keys_order.index(i[0])))
        column.clear()
        column.update(column1)
    model.clear()
    model.update(model1)

# Assume `output` is the string containing your YAML output
output = yaml.dump(data1, Dumper=yaml.Dumper)

# Add a newline before each line starting with " - name:"
output = re.sub(r"(?<!columns:\n)(  - name:.*\n)", r"\n\1", output)
# Now `output` contains the modified YAML

# Write the merged data back to the output file
with open(output_file, 'w') as file:
    file.write(output)

I call it using a just file.

# generate model yml with all columns from a model sql file.
generated_default := 'generated'
target_default := 'dev'
dbt-generate-model-yaml model_name generated_folder=generated_default target=target_default:
    @if [ ! -d "{{generated_folder}}" ]; then \
        mkdir -p {{generated_folder}}; \
    fi
    @{{dbt}} run-operation --target {{target}}  codegen.generate_model_yaml --args '{"model_names": ["{{model_name}}"]}' > /tmp/{{model_name}}.tmpyml
    @awk '/models:/{p=1} p' /tmp/{{model_name}}.tmpyml > /tmp/temp{{model_name}} && mv /tmp/temp{{model_name}} {{generated_folder}}/{{model_name}}.yml
    @echo "Model {{model_name}} generated in {{generated_folder}}/{{model_name}}.yml"

# update yaml from generated schema
dbt-update-column-yaml folder=default_folder target=target_default:
    #!/usr/bin/env bash
    for yml_file in $(find {{folder}} -type f -name '*.yml' ! -name '_*'); do
        model_name=${yml_file##*/}
        model_name=${model_name%.yml}
        just dbt-generate-model-yaml $model_name generated {{target}}
        python merge_yaml.py generated/$model_name.yml $yml_file $yml_file
    done

dbt-labs / dbt-codegen

`generate_model_yaml macro` to take original yaml file as an additional argument #73

Describe the feature