data-mie / dbt-profiler

Macros for generating dbt model data profiles
Apache License 2.0
81 stars 33 forks source link

`dbt_profiler.get_profile` doesn't seem to work with views in Redshift #86

Open stevenconnorg opened 6 months ago

stevenconnorg commented 6 months ago

It appears to be an issue in macro redshift__select_from_information_schema_columns in getting columns/data types from views.

Update to use pg_get_cols, e.g.:


select * from pg_get_cols('stg_final_view') 
cols(view_schema name, view_name name, col_name name, col_type varchar, col_num int);
stumelius commented 5 months ago

@stevenconnorg Thanks for reporting this and sorry for the late response. I'm okay with the pg_get_cols approach you described. Would you be willing to contribute the fix? I don't currently have access to a Redshift cluster and I'm therefore unable to test this. I'd appreciate your help :)

odikia commented 5 months ago

@stevenconnorg and @stumelius , what are your thoughts on adding additional support for external table references in addition to aforementioned views for Redshift? I have a local modification that I've made to dbt_utils package, that would likely also be required to dbt-profiler to take svv_external_schemas into account. Below is a local modification I had to make within a dbt_utils macro. Works well thus far.

with internalDbCTE as (
    select distinct
        table_schema as {{ adapter.quote('table_schema') }},
        table_name as {{ adapter.quote('table_name') }},
        {{ dbt_utils__vDGS.get_table_types_sql() }}
    from {{ database }}.information_schema.tables
    where table_schema ilike '{{ schema_pattern }}'
    and table_name ilike '{{ table_pattern }}'
    and table_name not ilike '{{ exclude }}'
)
-- following necessary for redshift external tables. May not work for other DBs
, externalDbCTE as (
    select distinct
        schemaname as {{ adapter.quote('schemaname') }},
        tablename as {{ adapter.quote('tablename') }},
        'EXTERNAL TABLE' as table_type
    from pg_catalog.svv_external_columns
    where redshift_database_name ilike '{{ database }}'
    and schemaname ilike '{{ schema_pattern }}'
    and tablename ilike '{{ table_pattern }}'
    and tablename not ilike '{{ exclude }}'
    )
select * from internalDbCTE
union all
select * from externalDbCTE

Here's an example of a call on my local DB

image

stumelius commented 5 months ago

@odikia Thank your for the suggestion. I don't have anything against adding support for external tables. Now, coming back to the issue of views not being included. In the query you provided, there's a union between information_schema.tables and pg_catalog.svv_external_columns - neither of these contain views, right? Do you know where we could get the view data? pg_get_cols works at the relation level and we'd have to loop over all the relations to get a complete picture - not ideal.

odikia commented 5 months ago

@stumelius , the naming convention may be wonky, but information_schema.tables does contain views! My query is just an example of a union that results in VIEW , BASE TABLE , and EXTERNAL TABLE information. If this union, or just information_schema.tables is okay given the local results in Redshift (pic I shared in this comment), then it will take a bit more testing given that the dtype differences between pg_catalog and information_schema may not function well downstream for dbt-profiler. I'll develop some use cases for each type of database object locally, to test out the union.

stumelius commented 5 months ago

@odikia Got it, thanks for clarifying! Feel free to open a PR when you have something and I'll have a look at it :)