how to handle staging models that depend on multiple sources in fct_staging_directories

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

graciegoheen commented 1 year ago

Just to give more context on this...

In fct_model_directories we're specifying staging_models that have an incorrect file path based on their upstream source dependencies. However, if a staging model is dependent on multiple sources (against our best practices, and will cause a warning for other checks in this package, but alas it could still happen) then there will be multiple records for the staging model potentially indicating multiple different values for change_file_path_to. This means that you may be unable to resolve the warnings from this check in the case where your staging model depends on 2 different sources (defined in 2 different directories).

staging_models as (
    select  
        child,
        child_resource_type,
        child_model_type,
        child_file_path,
        child_directory_path,
        child_file_name,
        parent_source_name
    from all_dag_relationships
    where parent_resource_type = 'source'
    and child_resource_type = 'model'
    and child_model_type = 'staging'
),

-- find all staging models that are NOT in their source parent's subdirectory
inappropriate_subdirectories_staging as (
    select distinct -- must do distinct to avoid duplicates when staging model has multiple paths to a given source
        child as resource_name,
        child_resource_type as resource_type,
        child_model_type as model_type,
        child_file_path as current_file_path,
        'models{{ directory_pattern }}' || '{{ var("staging_folder_name") }}' || '{{ directory_pattern }}' || parent_source_name || '{{ directory_pattern }}' || child_file_name as change_file_path_to
    from staging_models
    where child_directory_path not like '%' || parent_source_name || '%'
),

Curious what should / could be a solution here?

BradCr commented 11 months ago

A possible solution I came up with would be to create another CTE that identifies staging models that reference more than one source, and then do a listagg() on the recommended paths to roll them up into a single column with a message that mentions the offending staging model should be split into separate staging models that each reference a single source. The error mess could look like this:

More than one source. Split into separate staging models in: models/staging/sales/ AND models/staging/finance/

Would your expectation be that the current suggested path still appears for the staging model (in fct_model_directores.change_file_path_to) in addition to the new message about splitting the sources? If not, then the inappropriate_subdirectories_staging CTE can be modified to JUST skip over staging models who have more than one source, and the multi-source models would have to be identified somewhere else (in a new model, perhaps). Or would you want to suppress the individual errors until the multiple sources get resolved and only present the multi-error at first? I don't think that would be ideal because then a user will fail the test, fix it, and then fail the test a second time in order to see both sets of errors.

SQL Here are the two CTEs I propose, the output of which can unioned in the final CTE at the end:

--Add this to get all the stage models with count of parent source. Next step will be to count which ones have more than one source parent
staging_by_parent_source_count as (
    select
        count(distinct parent_source_name) as resource_count
        ,child as resource_name
        ,child_resource_type as resource_type
        ,child_model_type as model_type
        ,child_file_path as current_file_path
    from staging_models
    group by child, child_resource_type, child_model_type, child_file_path
),

--Added this CTE to listagg() the multuple suggested paths and advise the user to split the source file into those two places.
multiple_sources_staging_to_split as (
    select 
        staging_models.child as resource_name,
        staging_models.child_resource_type as resource_type,
        staging_models.child_model_type as model_type,
        staging_models.child_file_path as current_file_path,
        'More than one source. Split into separate staging models in: ' ||
          listagg('models/' || 'staging' || '/' || staging_models.parent_source_name || '/', ' AND ')
          within group(order by 'models/' || 'staging' || '/' || staging_models.parent_source_name || '/') as change_file_path_to
    from staging_models
    join staging_by_parent_source_count on staging_models.child = staging_by_parent_source_count.resource_name and
                                           staging_models.child_resource_type = staging_by_parent_source_count.resource_type and
                                           staging_models.child_model_type = staging_by_parent_source_count.model_type and
                                           staging_models.child_file_path = staging_by_parent_source_count.current_file_path
    where staging_by_parent_source_count.resource_count > 1
    group by staging_models.child, staging_models.child_resource_type, staging_models.child_model_type, staging_models.child_file_path
),

unioned as (
    select * from inappropriate_subdirectories_staging
    union all
    select * from innappropriate_subdirectories_non_staging_models
    union all --Added union all to append these results
    select * from multiple_sources_staging_to_split
)

select * from unioned;

I also found a small typo in the comments at the top of the model's query:

Where the second comment line says, "should be in nested in" there's an extra "in".

-- For staging models: The files should be in nested in the staging folder in a subfolder that matches their source parent's name.

BradCr commented 8 months ago

@graciegoheen I found this issue last year and had put down my thoughts, but since I focused on the too_many_joins change first I hadn't come back to this until now.

This is another issue I would like to take on. I've edited my original notes for clarity, and I want to get your input on how you would like to see the error messages.

graciegoheen commented 8 months ago

Hey! Thanks for being down to tackle this one.

More than one source. Split into separate staging models in: models/staging/sales/ AND models/staging/finance/

I like this suggestion. Here's what I'm thinking...

Scenario 1 If a model depends on multiple source tables, but those tables are within the same source...

#  models/staging/jaffle_shop/sources.yml

sources:
  - name: jaffle_shop
    tables:
      - name: orders
      - name: customers

# models/staging/stripe/my_model.sql

select * from {{ source('jaffle_shop', 'orders') }}
union all
select * from {{ source('jaffle_shop', 'customers') }}

we should only flag that my_model.sql should be moved to models/staging/jaffle_shop/.

Scenario 2 If a model depends on multiple source tables, AND those tables are within different sources...

#  models/staging/jaffle_shop/sources.yml

sources:
  - name: jaffle_shop
    tables:
      - name: orders

#  models/staging/stripe/sources.yml

sources:
  - name: stripe
    tables:
      - name: payments

# models/staging/bamboo/my_model.sql

select * from {{ source('jaffle_shop', 'orders') }}
union all
select * from {{ source('stripe', 'payments') }}

we should flag that my_model.sql should be split to separate models in models/staging/jaffle_shop/ and models/staging/stripe/.

What do you think? Any scenarios I missed?

BradCr commented 8 months ago

Happy to help!

Yeah, that sounds good. The only other scenario I can think of would be a combination of the two, but I think individually handling each scenario type will give the correct suggestions in the end. Let me put those two checks together and make sure they're good. Then I'll create something that breaks both best practices with a single model, to see what the output looks like.

BradCr commented 7 months ago

There's an issue when I try to build the model with my new CTEs. It looks like duckdb doesn't support listagg(). Here's the error message I get when I try to build.

Runtime Error in model fct_model_directories (models\marts\structure\fct_model_directories.sql) Parser Error: Unknown ordered aggregate "listagg".

My environment has duckdb version 0.9.2, dbt-duckdb version 1.7.1 and dbt-core version 1.7.5 .

Do you know if there's a similar function to listagg on duckdb? I can look into other commands that would work to bring multiple warnings into a single string. I also just posted on the dbt Slack #db-duckdb channel, to ask if anyone has suggestions.

graciegoheen commented 7 months ago

@BradCr Try out the dbt.listagg cross-db macro - docs here.

We use it a couple other places in this package (see here).

Though you might have to do some special things with sorting. Additionally, this only works on tables for redshift - so make sure it's materialized as a table (see here).

BradCr commented 7 months ago

Thanks, @graciegoheen! I also played around with the list_agg and you're right, the "within group" ordering bit is what was throwing off duckdb. I commented that out and it built fine. I'll work out how to use the macro.

I'll take a look at the docs for the macro and materializing the test model as a table.

BradCr commented 7 months ago

I'm running into an issue trying to get the dbt.listagg() to work on Duckdb. Not sure if it's something I'm doing.

In the CTE before the one I'm adding, I create the paths into a column called list_agg_string. Here is the code I put in the model to create the change_file_path_to for the output:

     'More than one source. Split into separate staging models in: ' ||
    {{ dbt.listagg(measure="list_agg_string", delimiter_text="' AND '", order_by_clause="list_agg_string") }}  as change_file_path_to

This compiles to:

    string_agg(
        list_agg_string
        , ' AND '
        list_agg_string
        )  as change_file_path_to

There is a comma missing after the delimiter ' AND ', and before the order by column of list_agg_string. That's one issue.

The second issue is, if I add the comma manually in the compiled SQL and run that in Duckdb, it tells me Duckdb doesn't support three variables in the stringagg command:

SQL Error: java.sql.SQLException: Binder Error: No function matches the given name and argument types 'string_agg(VARCHAR, VARCHAR, VARCHAR)'. You might need to add explicit type casts.
    Candidate functions:
    string_agg(VARCHAR) -> VARCHAR
    string_agg(VARCHAR, VARCHAR) -> VARCHAR

If I remove the order_by_clause from the Jinja code, it outputs None as the order by, but still without the comma between the delimiter and the order by.

    string_agg(
        list_agg_string
        , ' AND '
        None
        )  as change_file_path_to

BradCr commented 1 month ago

I figured out the problem from the comment above. I was putting this into the model:

,'More than one source. Split into separate staging models in: ' ||
{{ dbt.listagg(measure="list_agg_string", delimiter_text="' AND '", order_by_clause="list_agg_string") }} as change_file_path_to

but the order_by_clause was supposed to be order_by_clause="order by list_agg_string". I hadn't included the order by inside the quotes.

I got the model built and next I need to work on the integration tests.

dbt-labs / dbt-project-evaluator

how to handle staging models that depend on multiple sources in fct_staging_directories #111