Closed charlespicowski closed 11 months ago
@charlespicowski if there were some way to left
or full outer join
two different YAML files into a new one, would that fill the need you are looking for?
Practically speaking, the "join" described above would probably be more like "merge file a.yml
into file b.yml
" or vice versa.
I have a Github Action that uses yamlpath yaml-merge using a deep array of hash method that basically does the left
join that @dbeatty10 is talking about - basically the Action runs generate_source
and then merges the old one. I haven't quite gotten it ready to share publicly - it's not 100% pretty, but it works.
I'm also currently experimenting with changes to generate-source()
to do much the same thing using the same methods as generate_model_yaml
. That method reads only the descriptions from the graph nodes and includes those values when appending the description:
line to the yaml output. To expand this to include the other four possible fields for columns (meta
, data_type
, quote
, and tags
) is possible, but you'd also probably want to exclude appending empty fields (don't append meta
if there's no existing value, for instance). The code gets pretty repetitive to write yaml you've already got (in the existing file) - though I think it might be possible to figure out an elegant helper function or two that can handle all of it cleanly.
Another wrinkle is it is difficult, maybe not possible, to preserve multi-line descriptions block scalars (the >
followed by text on multiple lines for readibility) using the graph traversal approach of generate_model_yaml
. Again, you're trying to reverse engineer generating yaml you've already got, so I'm considering pivoting back to working on the yamlpath Github Action rather than continue to develop with dbt's macro ecosystem.
I'm imagining this as a separate macro. update_yaml_files
that would use a lot of the same functionality but would actually overwrite the file (or at least have the option to).
Caveat I have no idea about the inner workings of this package so don't know if this is possible given current base. But it has the assumption that the yamls can be merged.
It would be great if it could
generate_model_yaml
against that single model
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
I've created this. It is a separate python function to merge generated yaml with existing yaml.
import yaml
from collections import OrderedDict
import sys
import re
def selective_quote_presenter(dumper, data):
if re.search(r"[\'\{\}\[\],:]", data) or data == '': # Check if the string contains characters that are hard for yaml to parse
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style=None)
yaml.add_representer(str, selective_quote_presenter)
# Add a constructor for OrderedDict to the yaml module
def dict_constructor(loader, node):
return OrderedDict(loader.construct_pairs(node))
yaml.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, dict_constructor)
# Add a representer for OrderedDict to the Dumper class
def dict_representer(dumper, data):
return dumper.represent_mapping(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, data.items())
yaml.Dumper.add_representer(OrderedDict, dict_representer)
# Get the file names from command line arguments
generated = sys.argv[1]
base = sys.argv[2]
output_file = sys.argv[3]
# Load the contents of the first file
with open(generated, 'r') as file:
data1 = yaml.load(file, Loader=yaml.Loader)
# Load the contents of the second file
with open(base, 'r') as file:
data2 = yaml.load(file, Loader=yaml.Loader)
# merge the two dictionaries
for model2 in data2['models']:
model1 = next((item for item in data1['models'] if item['name'] == model2['name']), None)
if model1:
merged_model = {**model1, **{k: model2[k] for k in model2.keys() - {'columns'}}}
model1.update(merged_model)
for column2 in model2.get('columns', []):
column1 = next((item for item in model1.get('columns', []) if item['name'] == column2['name']), None)
if column1:
merged_column = {**column1, **column2}
column1.update(merged_column)
else:
model1.get('columns', []).append(column2)
else:
data1['models'].append(model2)
model_keys_order = ['name', 'tags', 'description', 'docs', 'latest_version', 'deprecation_date', 'access', 'config', 'constraints', 'tests', 'columns', 'versions']
column_keys_order = ['name', 'data_type', 'description', 'meta', 'quote', 'constraints', 'tests', 'tags']
# After merging, sort the keys
for model in data1['models']:
model1 = OrderedDict(sorted(model.items(), key=lambda i: model_keys_order.index(i[0])))
for column in model1.get('columns', []):
column1 = OrderedDict(sorted(column.items(), key=lambda i: column_keys_order.index(i[0])))
column.clear()
column.update(column1)
model.clear()
model.update(model1)
# Assume `output` is the string containing your YAML output
output = yaml.dump(data1, Dumper=yaml.Dumper)
# Add a newline before each line starting with " - name:"
output = re.sub(r"(?<!columns:\n)( - name:.*\n)", r"\n\1", output)
# Now `output` contains the modified YAML
# Write the merged data back to the output file
with open(output_file, 'w') as file:
file.write(output)
I call it using a just file.
# generate model yml with all columns from a model sql file.
generated_default := 'generated'
target_default := 'dev'
dbt-generate-model-yaml model_name generated_folder=generated_default target=target_default:
@if [ ! -d "{{generated_folder}}" ]; then \
mkdir -p {{generated_folder}}; \
fi
@{{dbt}} run-operation --target {{target}} codegen.generate_model_yaml --args '{"model_names": ["{{model_name}}"]}' > /tmp/{{model_name}}.tmpyml
@awk '/models:/{p=1} p' /tmp/{{model_name}}.tmpyml > /tmp/temp{{model_name}} && mv /tmp/temp{{model_name}} {{generated_folder}}/{{model_name}}.yml
@echo "Model {{model_name}} generated in {{generated_folder}}/{{model_name}}.yml"
# update yaml from generated schema
dbt-update-column-yaml folder=default_folder target=target_default:
#!/usr/bin/env bash
for yml_file in $(find {{folder}} -type f -name '*.yml' ! -name '_*'); do
model_name=${yml_file##*/}
model_name=${model_name%.yml}
just dbt-generate-model-yaml $model_name generated {{target}}
python merge_yaml.py generated/$model_name.yml $yml_file $yml_file
done
Looking at the generate_model_yaml macro, I can see that it takes the following arguments:
model_name
upstream_descriptions
As I understand it, running
dbt run-operation generate_model_yaml --args '{"model_name": "dim_customer"}'
will produce a yaml similar to the following
But what if there is an already existing
yml
file fordim_customers
such as...Is it possible we can take these original descriptions, tests and meta things too for the new output produced by the
generate_model_yaml
macro?Why someone would want this: Most of the time you are not in fact generating a model
yaml
from scratch, there is an already existing one, and you have made changes to it. It's really a quality of life improvement, but it would greatly improve one's workflow and make it easier to keep documentation best practices.