dbt-labs / dbt-codegen

Macros that generate dbt code
https://hub.getdbt.com/dbt-labs/codegen/latest/
Apache License 2.0
459 stars 99 forks source link

Improving interface within a dbt repo using `just` #144

Closed wjhrdy closed 4 months ago

wjhrdy commented 10 months ago

Describe the feature

The interface for this package is clunky, so I abstracted it using a justfile. adding a step to generate all missing yml.

just dbt-generate-missing-yaml


# generate _*sources.yml for all tables in a schema.
dbt-generate-source database schema:
    {{dbt}} run-operation codegen.generate_source --args '{"schema_name": "{{schema}}", "database_name": "{{database}}"}'

# generate the model sql for a table defined in your sources yml to speed up renaming step.
dbt-generate-base-model source table:
    @{{dbt}} run-operation codegen.generate_base_model --args '{"source_name": "{{source}}", "table_name": "{{table}}"}'  > generated/stg_{{table}}.sql
    @awk '/with source as \(/{p=1} p' generated/stg_{{table}}.sql > temp && mv temp generated/stg_{{table}}.sql
    @echo "Model {{table}} generated in generated/stg_{{table}}.sql"

# generate model yml with all columns from a model sql file.
generated_default := 'generated'
dbt-generate-model-yaml model_name generated_folder=generated_default:
    @if [ ! -d "{{generated_folder}}" ]; then \
        mkdir -p {{generated_folder}}; \
    fi
    @{{dbt}} run-operation codegen.generate_model_yaml --args '{"model_names": ["{{model_name}}"]}' > /tmp/{{model_name}}.tmpyml
    @awk '/models:/{p=1} p' /tmp/{{model_name}}.tmpyml > /tmp/temp{{model_name}} && mv /tmp/temp{{model_name}} {{generated_folder}}/{{model_name}}.yml
    @echo "Model {{model_name}} generated in {{generated_folder}}/{{model_name}}.yml"

# generate model yml with all columns for all sql files without accompanying yml files.
# optionally accept a parameter for folder to search in
default_folder := 'models'
dbt-generate-missing-yaml folder=default_folder:
    @for sql_file in $(find {{folder}} -type f -name '*.sql'); do \
        yml_file=${sql_file%.sql}.yml; \
        if [ ! -f $yml_file ]; then \
            model_name=${sql_file##*/}; \
            model_name=${model_name%.sql}; \
            folder_name=$(dirname ${sql_file}); \
            just dbt-generate-model-yaml $model_name $folder_name; \
        fi; \
    done

# Clean up model references in your sql files by generating CTE aliases for all models referenced in a sql file.
dbt-generate-model-import-ctes model_name generated_folder=generated_default:
    @if [ ! -d "{{generated_folder}}" ]; then \
        mkdir -p {{generated_folder}}; \
    fi
    @{{dbt}} run-operation codegen.generate_model_import_ctes --args '{"model_name": "{{model_name}}"}'> /tmp/{{model_name}}.tmpsql
    @awk '/with .* as \($/{p=1} p' /tmp/{{model_name}}.tmpsql | sed 's/^.*with/with/' > /tmp/temp{{model_name}} && mv /tmp/temp{{model_name}} {{generated_folder}}/{{model_name}}.sql
    @echo "SQL {{model_name}} edited in {{generated_folder}}/{{model_name}}.sql"

Currently I'm parsing the stdout and saving to a folder called generated/ which is cumbersome. It would be nice to have a flag to save the output to a file.

Hope this justfile can help people using this repo.

wjhrdy commented 8 months ago

updated to the latest

wjhrdy commented 8 months ago

updated to the latest

gwenwindflower commented 6 months ago

hey @wjhrdy this is awesome, i totally agree with what you're doing here. i have a personal set of bash scripts, python scripts, and Task configs i use to wrap this repo because i agree the interface's original goal of not editing files directly and accomplishing all this with dbt itself leads to not the best workflow.

appreciate your patience on this, i'm doing some work to catch this repo up. if you're still interested in contributing this i'd love to help you get this across the finish line.

my biggest question for you is this: would you open to porting these flows to Task and perhaps collaborating together on a robust command set? i ask this because Task is extremely similar and spiritually has the same goals as just but is configured with YAML, which is more familiar to the average dbt user.

i'm also (i think) open to perhaps offering a plethora of workflow enhancements, bash scripts, python scripts, task, and just to allow people to choose what they prefer. though my immediate preference would be to unify around one approach for simplicity. open to discussing!

let me know if you're still interested in contributing this and what you think re the above!

wjhrdy commented 5 months ago

Hey @gwenwindflower, thanks so much for your response and engagement on this! I'm thrilled to hear you're supportive of the direction I'm going with this and agree that the current interface has room for improvement in terms of developer workflow.

I really appreciate you taking the time to catch up on this repo and offer to collaborate. To answer your question - I'm definitely still interested in contributing and would be happy to work together on a robust set of commands to enhance the dbt-codegen workflow.

Regarding Task vs Justfile - while I agree that YAML is generally more familiar to dbt users, I believe Justfile has some advantages that make it well-suited for this use case:

  1. Justfiles use a simple, intuitive syntax that is easy to read and write even for those less familiar with it. The commands are essentially just shell commands.

  2. Justfiles provide a flat namespace of commands, making it easy to see all available commands at a glance. With Task, the hierarchy of commands can sometimes make it harder to get that bird's-eye view.

  3. Justfiles have very little indirection - what you see is what gets executed. Task's YAML configuration layer can sometimes obscure what's actually happening.

  4. Justfiles are lightweight and have minimal dependencies - just having just installed is all you need. Task requires a Python environment which adds some overhead.

So while Task is certainly a viable option with advantages like YAML familiarity, I believe the simplicity, transparency and light weight of Justfiles makes them an ideal fit for optimizing dbt-codegen workflows.

That said, I'm certainly open to discussing the best approach, whether that's Just, Task, bash/python scripts or some combination! Having a curated set of workflow enhancements that help folks get the most out of dbt-codegen would be fantastic.

I'm excited to collaborate with you on this! Let me know what you think about the Just vs Task tradeoffs and how you'd like to proceed. Looking forward to working together to make dbt-codegen workflows even better!

gwenwindflower commented 5 months ago

Awesome, that all makes sense to me! These all seem like very good reasons for preferring just, and it's a very loosely held preference for me so as long as you feel strongly about this approach I'm down with it 😸. So let's focus on just for now and see how far we can go! I've written bash scripts for building all the sources and staging models within a target schema(s) while respecting existing files with one command, so we can definitely get at least that far with this.

So to start with I'll review the current state and drop any comments on where we might expand, I'll probably branch off this first and hack on it a bit myself as I tend to best understand code when I'm working with it, and I'll get back to you after that. Thanks again for the thought and work on this, wrapping codegen properly in the repo itself has been a pet project/on my todo list forever so I'm glad to finally be doing it!

gwenwindflower commented 5 months ago

@wjhrdy -- so i've got setup and got my head around the just basics for the most part; on a new branch which i'll push up and make into a draft PR for us to collaborate in shortly, but have some questions.

question 1: why is {{dbt}} a variable? when i run commands just complains that variable is not defined. when i just make it a regular command everything seems to work.

question 2: what is the purpose of all the temp folder renaming and moving here? why not just pipe directly into the file at the end?

@awk '/models:/{p=1} p' /tmp/{{model_name}}.tmpyml > /tmp/temp{{model_name}} && mv /tmp/temp{{model_name}} {{generated_folder}}/{{model_name}}.yml
wjhrdy commented 5 months ago
  1. The {{dbt}} variable is defined at the top of the Justfile as:

    dbt := 'poetry run dbt'

    This allows you to run dbt commands within the Poetry virtual environment. If you're getting an error that the variable is not defined, make sure you're running the commands using just, e.g., just dbt run. If you run dbt directly, it won't have access to the variables defined in the Justfile.

  2. The purpose of using temporary files and renaming is to ensure atomic writes and avoid race conditions or partial writes. Here's the specific line you mentioned:

    @awk '/models:/{p=1} p' /tmp/{{model_name}}.tmpyml > /tmp/temp{{model_name}} && mv /tmp/temp{{model_name}} {{generated_folder}}/{{model_name}}.yml

    This line does the following:

    • It uses awk to extract the relevant portion of the YAML file starting from the models: section and writes it to a temporary file /tmp/temp{{model_name}}.
    • Once the awk command finishes successfully, it moves the temporary file to the final destination {{generated_folder}}/{{model_name}}.yml using the mv command.

    The reason for using a temporary file and then moving it is to ensure that the final file is written completely and atomically. If you were to pipe directly to the final file and the command encountered an error or was interrupted, you could end up with a partially written or corrupted file. By writing to a temporary file first and then moving it, you guarantee that the final file is only updated when the entire operation has completed successfully.

    This pattern of using temporary files and atomic moves is a common practice in shell scripting to ensure data integrity and avoid issues with partial writes or race conditions.

Let me know if you have any further questions!

wjhrdy commented 5 months ago

Another huge enhancement i made to this justfile was the addition of a yaml merge script

https://github.com/dbt-labs/dbt-codegen/issues/73#issuecomment-1910941957

gwenwindflower commented 5 months ago

Ah, thank you! I think you may have accidentally deleted the definition of dbt in the most recent update, hence the confusion 😅

GTK re the tmp folder pattern, makes perfect sense, my shell scripting just got more robust!

Screenshot 2024-03-07 at 3 14 39 PM
gwenwindflower commented 5 months ago

Another huge enhancement i made to this justfile was the addition of a yaml merge script

#73 (comment)

Nice! I was literally just about to brainstorm about that, the other nice thing about the tmp file pattern is you've got this nice staging ground to process the file and do some more sophisticated things that require access to the whole file ideally so would be harder to do in a big pipe chain. Dig it!

jasnonaz commented 4 months ago

Closing via discussion here