cdp-ucsc / cdp-ucsc-dbt-codegen

UCSC's CDP forked project of dbt-labs dbt-codegen
https://hub.getdbt.com/dbt-labs/codegen/latest/
Apache License 2.0
0 stars 0 forks source link

Add a feature to handle setid for UCPath tables #1

Open insuhpak opened 3 weeks ago

insuhpak commented 3 weeks ago

Ideally I would like to avoid making a new macro specifically for setid'd UCPath tables.

(A general feature approach)

A feature could be added such that a where condition or conditions could be declared in the source.yml for each table and then surfaced in the generated model.

      - name: ps_jobcode_tbl
        description: jobcode validation table
        meta:
          effective_dated: true
          partition_columns: ["jobcode", "setid"]
          final_where_condition: ["setid in ('UCSHR')"]

The declared final_where_condition would surface in the final select statement of the staging model. For example,

-- Dummy Line

/*  The partition columns are: ['jobcode', 'setid']
*/

with
    source as (
        select * from {{ source(  'ucp', 'ps_jobcode_tbl' ) }}
        where
            _fivetran_deleted != true
            and dml_ind != 'D'

    ),

    valid_to as (
        -- Group the records so that each group belonging to one effdt gets the same valid_to date
        select
            jobcode,
            setid,
            effdt,
            coalesce((lag(to_date(effdt), 1) over (
                partition by
                    jobcode,
                    setid
                order by effdt desc
            ) - 1), '2099-12-31') as valid_to

        from source
        group by
            1,
            2,
            3
    ),

    final as (
        select
            source._fivetran_deleted,
            source._fivetran_id,
            source._fivetran_synced,
            source.avail_telework,
            source.barg_unit,
            source.can_noc_cd,
            source.comp_frequency,
            source.company,
            source.cr_bt_dtm,
            source.cr_bt_nbr,
            source.currency_cd,
            source.descr,
            source.descrshort,
            source.directly_tipped,
            source.dml_ind,
            source.eeo1code,
            source.eeo4code,
            source.eeo5code,
            source.eeo6code,
            source.eeo_job_group,
            source.eff_status,
            source.effdt,
            source.eg_academic_rank,
            source.eg_group,
            source.encumb_sal_amt,
            source.encumb_sal_optn,
            source.encumber_indc,
            source.flsa_status,
            source.function_cd,
            source.grade,
            source.ipedsscode,
            source.job_family,
            source.job_function,
            source.job_sub_func,
            source.jobcode,
            source.key_jobcode,
            source.last_update_date,
            source.lastupddttm,
            source.lastupdoprid,
            source.manager_level,
            source.med_chkup_req,
            source.ods_vrsn_nbr,
            source.posn_mgmt_indc,
            source.reg_region,
            source.reg_temp,
            source.retro_percent,
            source.retro_rate,
            source.sal_admin_plan,
            source.sal_range_currency,
            source.sal_range_freq,
            source.sal_range_max_rate,
            source.sal_range_mid_rate,
            source.sal_range_min_rate,
            source.setid,
            source.setid_salary,
            source.std_hours,
            source.std_hrs_frequency,
            source.step,
            source.survey_job_code,
            source.survey_salary,
            source.trn_program,
            source.union_cd,
            source.upd_bt_dtm,
            source.upd_bt_nbr,
            source.us_occ_cd,
            source.us_soc_cd,
            source.workers_comp_cd,

            -- New Objects 
            to_date(source.effdt) as valid_from,
            valid_to.valid_to as valid_to,
            case
                when valid_from > {{ var("current_date_pst") }} then 'future'
                when valid_to.valid_to < {{ var("current_date_pst") }} then 'past'
                when valid_to.valid_to >= {{ var("current_date_pst") }} then 'current'
            end as current_record_desc,
            case
                when current_record_desc = 'current' then true
                when current_record_desc in ('future', 'past') then false
            end as is_current_record

        from source

        left outer join valid_to
            on source.jobcode = valid_to.jobcode
                and source.setid = valid_to.setid
                and source.effdt = valid_to.effdt
    )

select * from final
where -- <-- THIS WHERE CONDITION WOULD BE VARIABLE AND ONLY POPULATE WHEN THE final_where_condition KEY IS POPULATED
    setid in ('UCHSR')

(A UCPath setid specific approach)

If we choose to make a setid specific UCPath macro, then the meta information declared can be more specific (naming wise).

      - name: ps_jobcode_tbl
        description: jobcode validation table
        meta:
          effective_dated: true
          partition_columns: ["jobcode", "setid"]
          ucsc_setid: ["UCSHR"]

We could also have more freedom in where the condition is set. For example we can apply the condition at the source CTE,

-- Dummy Line

/*  The partition columns are: ['jobcode', 'setid']
*/

with
    source as (
        select * from {{ source(  'ucp', 'ps_jobcode_tbl' ) }}
        where
            _fivetran_deleted != true
            and dml_ind != 'D'
            and setid in ('UCSHR') -- <-- THIS LINE WOULD NEED TO BE HARCODED IN THE MACRO. THE IN LIST VALUES WOULD BE THE ONLY VARIABLE PORTION.
    ),

    valid_to as (
        -- Group the records so that each group belonging to one effdt gets the same valid_to date
        select
            jobcode,
            setid,
            effdt,
            coalesce((lag(to_date(effdt), 1) over (
                partition by
                    jobcode,
                    setid
                order by effdt desc
            ) - 1), '2099-12-31') as valid_to

        from source
        group by
            1,
            2,
            3
    ),

    final as (
        select
            source._fivetran_deleted,
            source._fivetran_id,
            source._fivetran_synced,
            source.avail_telework,
            source.barg_unit,
            source.can_noc_cd,
            source.comp_frequency,
            source.company,
            source.cr_bt_dtm,
            source.cr_bt_nbr,
            source.currency_cd,
            source.descr,
            source.descrshort,
            source.directly_tipped,
            source.dml_ind,
            source.eeo1code,
            source.eeo4code,
            source.eeo5code,
            source.eeo6code,
            source.eeo_job_group,
            source.eff_status,
            source.effdt,
            source.eg_academic_rank,
            source.eg_group,
            source.encumb_sal_amt,
            source.encumb_sal_optn,
            source.encumber_indc,
            source.flsa_status,
            source.function_cd,
            source.grade,
            source.ipedsscode,
            source.job_family,
            source.job_function,
            source.job_sub_func,
            source.jobcode,
            source.key_jobcode,
            source.last_update_date,
            source.lastupddttm,
            source.lastupdoprid,
            source.manager_level,
            source.med_chkup_req,
            source.ods_vrsn_nbr,
            source.posn_mgmt_indc,
            source.reg_region,
            source.reg_temp,
            source.retro_percent,
            source.retro_rate,
            source.sal_admin_plan,
            source.sal_range_currency,
            source.sal_range_freq,
            source.sal_range_max_rate,
            source.sal_range_mid_rate,
            source.sal_range_min_rate,
            source.setid,
            source.setid_salary,
            source.std_hours,
            source.std_hrs_frequency,
            source.step,
            source.survey_job_code,
            source.survey_salary,
            source.trn_program,
            source.union_cd,
            source.upd_bt_dtm,
            source.upd_bt_nbr,
            source.us_occ_cd,
            source.us_soc_cd,
            source.workers_comp_cd,

            -- New Objects 
            to_date(source.effdt) as valid_from,
            valid_to.valid_to as valid_to,
            case
                when valid_from > {{ var("current_date_pst") }} then 'future'
                when valid_to.valid_to < {{ var("current_date_pst") }} then 'past'
                when valid_to.valid_to >= {{ var("current_date_pst") }} then 'current'
            end as current_record_desc,
            case
                when current_record_desc = 'current' then true
                when current_record_desc in ('future', 'past') then false
            end as is_current_record

        from source

        left outer join valid_to
            on source.jobcode = valid_to.jobcode
                and source.setid = valid_to.setid
                and source.effdt = valid_to.effdt
    )

select * from final

The reason final_where_condition is restricted to the final select statement is so devs have more freedom in what kind of conditions they can specify. Since it is going to be added to the final select statement then can apply conditions on fields that were added during the staging process.

I prefer the final_where_condition method because this can be used in other models and not just for UCPath setid tables.

For example for our effseq'd models we could final_where_condition = ["setid in ('UCHSR ')", "is_max_effseq_of_effdt = true"].

.
.
.
select * from final
    where
        setid in ('UCHSR')
        and is_max_effseq_of_effdt = true
magants commented 3 weeks ago

I like the options Insuh. I do like handling setid at the source CTE (feels appropriate there), but I also like the final_where_condition tag being able to handle other conditions.

Is the reason the ucsc_setid meta tag wouldn't work in the general feature approach due to the fact that setid doesn't exist in other projects (besides Student)? If it doesn't exist couldn't we just not populate it and then the macro wouldn't take it into consideration?

Let's bring this up at the next AE meeting. I'm interested in what the others have to say.