brooklyn-data / dbt_artifacts

A dbt package for modelling dbt metadata. https://brooklyn-data.github.io/dbt_artifacts
Apache License 2.0
330 stars 125 forks source link

Implement and document usage for dbt Cloud #19

Closed kgpayne closed 3 years ago

kgpayne commented 3 years ago

A few questions have come though on the dbt slack workspace about deployment on dbt Cloud relating to artefact uploads. It would be useful to have a definitive method for deployment on dbt Cloud, in addition to the manual upload and CLI run-operation options.

As I understand it, the key issue revolves around the invocation_id generated by dbt and stored in run-results.json and manifest.json. This is the same for a given command, however run-operation (used to call the macro to upload artefacts) creates its own manifest.json replacing the one from any previous operation 🤦‍♂️ This results in dissimilar invocation_ids in run-results.json and manifest.json, breaking functionality that relies on joining the two artefacts.

For those calling run-operation via the CLI, the --no-write-json global flag is reported to solve this particular issue. It would also be possible to configure an env_var prefixed with DBT_ENV_CUSTOM_ENV_ that is injected into each artefact as per the docs. These vars would appear in stg_dbt__run_results_env_keys and could be used in other artefacts to correlate artefacts from the same 'run' (this is not implemented in dbt_artifacts but could be if it was of use).

However, dbt Cloud supports neither --no-write-json nor custom env_vars 🤔 dbt Cloud does however provide env_vars with job and run id's that could potentially be used for this purpose. I cannot determine if these make it into the artefacts or not. If they don't, an alternative approach might be to inject them directly via the upload_artifacts copy query. Something like:

begin;
        copy into {{ src_dbt_artifacts }} from
            (
                select
                    $1 as data,
                    $1:metadata:generated_at::timestamp_ntz as generated_at,
                    metadata$filename as path,
                    regexp_substr(metadata$filename, '([a-z_]+.json)') as artifact_type,
                    -- dbt cloud env_vars
                    {{ env_var('DBT_CLOUD_PROJECT_ID', 'not on cloud') }} as dbt_cloud_project_id,
                    {{ env_var('DBT_CLOUD_JOB_ID', 'not on cloud') }} as dbt_cloud_job_id,
                    {{ env_var('DBT_CLOUD_RUN_ID', 'not on cloud') }} as dbt_cloud_run_id,
                    {{ env_var('DBT_CLOUD_RUN_REASON_CATEGORY', 'not on cloud') }} as dbt_cloud_run_reason_category,
                    {{ env_var('DBT_CLOUD_RUN_REASON', 'not on cloud') }} as dbt_cloud_run_reason,
                    {{ env_var('DBT_CLOUD_PR_ID', 'not on cloud') }} as dbt_cloud_pr_id,
                    {{ env_var('DBT_CLOUD_GIT_SHA', 'not on cloud') }} as dbt_cloud_git_sha
                from  @{{ src_dbt_artifacts }}
            )
            file_format=(type='JSON')
            on_error='skip_file';
        commit;

Help, testing and comments on this very welcome 🙏

john-jerome commented 3 years ago

I cannot determine if these make it into the artefacts or not.

I believe they do (at least some of them): DBT_CLOUD_PROJECT_ID, DBT_CLOUD_RUN_ID, DBT_CLOUD_JOB_ID, DBT_CLOUD_RUN_REASON, DBT_CLOUD_RUN_REASON_CATEGORY are present in both manifest.json and run_results.json

Edit: DBT_CLOUD_PR_ID and DBT_CLOUD_GIT_SHA are only there if it's a PR-triggered run.

john-jerome commented 3 years ago

I tested the solution you suggested and it seems to be working! One small thing to note: we need to be careful with quotes. All those env vars should be wrapped in ' ', so this is the code that worked for me:

begin;
        copy into {{ src_dbt_artifacts }} from
            (
                select
                $1 as data,
                $1:metadata:generated_at::timestamp_ntz as generated_at,
                metadata$filename as path,
                regexp_substr(metadata$filename, '([a-z_]+.json)') as artifact_type,
                -- dbt cloud env_vars
                '{{ env_var('DBT_CLOUD_PROJECT_ID', 'not on cloud') }}' as dbt_cloud_project_id,
                '{{ env_var('DBT_CLOUD_JOB_ID', 'not on cloud') }}' as dbt_cloud_job_id,
                '{{ env_var('DBT_CLOUD_RUN_ID', 'not on cloud') }}' as dbt_cloud_run_id,
                '{{ env_var('DBT_CLOUD_RUN_REASON_CATEGORY', 'not on cloud') }}' as dbt_cloud_run_reason_category,
                '{{ env_var('DBT_CLOUD_RUN_REASON', 'not on cloud') }}' as dbt_cloud_run_reason

            from  @{{ src_dbt_artifacts }}
            )
            file_format=(type='JSON')
            on_error='skip_file';
        commit;

Also, the create_artifact_resources macro has to be updated accordingly:

create table if not exists {{ src_dbt_artifacts }} (
    data variant,
    generated_at timestamp,
    path string,
    artifact_type string,
    dbt_cloud_project_id string,
    dbt_cloud_job_id string,
    dbt_cloud_run_id string,
    dbt_cloud_run_reason_category string,
    dbt_cloud_run_reason string
);

The next step would probably be to update the corresponding models so that they now make use of dbt_cloud_run_id instead of invocation_id (where it makes sense). For now, I'll just move those models from the package to our project and modify them there, but it'd be nice to have this fix inside the package, too :)