Open graciegoheen opened 2 years ago
Just to give more context on this...
In fct_model_directories
we're specifying staging_models that have an incorrect file path based on their upstream source dependencies. However, if a staging model is dependent on multiple sources (against our best practices, and will cause a warning for other checks in this package, but alas it could still happen) then there will be multiple records for the staging model potentially indicating multiple different values for change_file_path_to
. This means that you may be unable to resolve the warnings from this check in the case where your staging model depends on 2 different sources (defined in 2 different directories).
staging_models as (
select
child,
child_resource_type,
child_model_type,
child_file_path,
child_directory_path,
child_file_name,
parent_source_name
from all_dag_relationships
where parent_resource_type = 'source'
and child_resource_type = 'model'
and child_model_type = 'staging'
),
-- find all staging models that are NOT in their source parent's subdirectory
inappropriate_subdirectories_staging as (
select distinct -- must do distinct to avoid duplicates when staging model has multiple paths to a given source
child as resource_name,
child_resource_type as resource_type,
child_model_type as model_type,
child_file_path as current_file_path,
'models{{ directory_pattern }}' || '{{ var("staging_folder_name") }}' || '{{ directory_pattern }}' || parent_source_name || '{{ directory_pattern }}' || child_file_name as change_file_path_to
from staging_models
where child_directory_path not like '%' || parent_source_name || '%'
),
Curious what should / could be a solution here?
A possible solution I came up with would be to create another CTE that identifies staging models that reference more than one source, and then do a listagg()
on the recommended paths to roll them up into a single column with a message that mentions the offending staging model should be split into separate staging models that each reference a single source. The error mess could look like this:
More than one source. Split into separate staging models in: models/staging/sales/ AND models/staging/finance/
Would your expectation be that the current suggested path still appears for the staging model (in fct_model_directores.change_file_path_to
) in addition to the new message about splitting the sources? If not, then the inappropriate_subdirectories_staging
CTE can be modified to JUST skip over staging models who have more than one source, and the multi-source models would have to be identified somewhere else (in a new model, perhaps). Or would you want to suppress the individual errors until the multiple sources get resolved and only present the multi-error at first? I don't think that would be ideal because then a user will fail the test, fix it, and then fail the test a second time in order to see both sets of errors.
SQL Here are the two CTEs I propose, the output of which can unioned in the final CTE at the end:
--Add this to get all the stage models with count of parent source. Next step will be to count which ones have more than one source parent
staging_by_parent_source_count as (
select
count(distinct parent_source_name) as resource_count
,child as resource_name
,child_resource_type as resource_type
,child_model_type as model_type
,child_file_path as current_file_path
from staging_models
group by child, child_resource_type, child_model_type, child_file_path
),
--Added this CTE to listagg() the multuple suggested paths and advise the user to split the source file into those two places.
multiple_sources_staging_to_split as (
select
staging_models.child as resource_name,
staging_models.child_resource_type as resource_type,
staging_models.child_model_type as model_type,
staging_models.child_file_path as current_file_path,
'More than one source. Split into separate staging models in: ' ||
listagg('models/' || 'staging' || '/' || staging_models.parent_source_name || '/', ' AND ')
within group(order by 'models/' || 'staging' || '/' || staging_models.parent_source_name || '/') as change_file_path_to
from staging_models
join staging_by_parent_source_count on staging_models.child = staging_by_parent_source_count.resource_name and
staging_models.child_resource_type = staging_by_parent_source_count.resource_type and
staging_models.child_model_type = staging_by_parent_source_count.model_type and
staging_models.child_file_path = staging_by_parent_source_count.current_file_path
where staging_by_parent_source_count.resource_count > 1
group by staging_models.child, staging_models.child_resource_type, staging_models.child_model_type, staging_models.child_file_path
),
unioned as (
select * from inappropriate_subdirectories_staging
union all
select * from innappropriate_subdirectories_non_staging_models
union all --Added union all to append these results
select * from multiple_sources_staging_to_split
)
select * from unioned;
I also found a small typo in the comments at the top of the model's query:
Where the second comment line says, "should be in nested in" there's an extra "in".
-- For staging models: The files should be in nested in the staging folder in a subfolder that matches their source parent's name.
@graciegoheen I found this issue last year and had put down my thoughts, but since I focused on the too_many_joins change first I hadn't come back to this until now.
This is another issue I would like to take on. I've edited my original notes for clarity, and I want to get your input on how you would like to see the error messages.
Hey! Thanks for being down to tackle this one.
More than one source. Split into separate staging models in: models/staging/sales/ AND models/staging/finance/
I like this suggestion. Here's what I'm thinking...
Scenario 1 If a model depends on multiple source tables, but those tables are within the same source...
# models/staging/jaffle_shop/sources.yml
sources:
- name: jaffle_shop
tables:
- name: orders
- name: customers
# models/staging/stripe/my_model.sql
select * from {{ source('jaffle_shop', 'orders') }}
union all
select * from {{ source('jaffle_shop', 'customers') }}
we should only flag that my_model.sql
should be moved to models/staging/jaffle_shop/
.
Scenario 2 If a model depends on multiple source tables, AND those tables are within different sources...
# models/staging/jaffle_shop/sources.yml
sources:
- name: jaffle_shop
tables:
- name: orders
# models/staging/stripe/sources.yml
sources:
- name: stripe
tables:
- name: payments
# models/staging/bamboo/my_model.sql
select * from {{ source('jaffle_shop', 'orders') }}
union all
select * from {{ source('stripe', 'payments') }}
we should flag that my_model.sql
should be split to separate models in models/staging/jaffle_shop/
and models/staging/stripe/
.
What do you think? Any scenarios I missed?
Happy to help!
Yeah, that sounds good. The only other scenario I can think of would be a combination of the two, but I think individually handling each scenario type will give the correct suggestions in the end. Let me put those two checks together and make sure they're good. Then I'll create something that breaks both best practices with a single model, to see what the output looks like.
There's an issue when I try to build the model with my new CTEs. It looks like duckdb doesn't support listagg()
. Here's the error message I get when I try to build.
Runtime Error in model fct_model_directories (models\marts\structure\fct_model_directories.sql) Parser Error: Unknown ordered aggregate "listagg".
My environment has duckdb version 0.9.2, dbt-duckdb version 1.7.1 and dbt-core version 1.7.5 .
Do you know if there's a similar function to listagg on duckdb? I can look into other commands that would work to bring multiple warnings into a single string. I also just posted on the dbt Slack #db-duckdb channel, to ask if anyone has suggestions.
Thanks, @graciegoheen! I also played around with the list_agg and you're right, the "within group" ordering bit is what was throwing off duckdb. I commented that out and it built fine. I'll work out how to use the macro.
I'll take a look at the docs for the macro and materializing the test model as a table.
I'm running into an issue trying to get the dbt.listagg()
to work on Duckdb. Not sure if it's something I'm doing.
In the CTE before the one I'm adding, I create the paths into a column called list_agg_string
. Here is the code I put in the model to create the change_file_path_to
for the output:
'More than one source. Split into separate staging models in: ' ||
{{ dbt.listagg(measure="list_agg_string", delimiter_text="' AND '", order_by_clause="list_agg_string") }} as change_file_path_to
This compiles to:
string_agg(
list_agg_string
, ' AND '
list_agg_string
) as change_file_path_to
There is a comma missing after the delimiter ' AND '
, and before the order by column of list_agg_string
. That's one issue.
The second issue is, if I add the comma manually in the compiled SQL and run that in Duckdb, it tells me Duckdb doesn't support three variables in the stringagg
command:
SQL Error: java.sql.SQLException: Binder Error: No function matches the given name and argument types 'string_agg(VARCHAR, VARCHAR, VARCHAR)'. You might need to add explicit type casts.
Candidate functions:
string_agg(VARCHAR) -> VARCHAR
string_agg(VARCHAR, VARCHAR) -> VARCHAR
If I remove the order_by_clause
from the Jinja code, it outputs None
as the order by, but still without the comma between the delimiter and the order by.
string_agg(
list_agg_string
, ' AND '
None
) as change_file_path_to
I figured out the problem from the comment above. I was putting this into the model:
,'More than one source. Split into separate staging models in: ' ||
{{ dbt.listagg(measure="list_agg_string", delimiter_text="' AND '", order_by_clause="list_agg_string") }} as change_file_path_to
but the order_by_clause
was supposed to be order_by_clause="order by list_agg_string"
. I hadn't included the order by
inside the quotes.
I got the model built and next I need to work on the integration tests.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.