fivetran / dbt_ad_reporting

Fivetran's ad reporting dbt package. Combine your Facebook, Google, Pinterest, LinkedIn, Twitter, Snapchat, Microsoft, TikTok, Reddit, Amazon, and Apple Search advertising metrics using this package.
https://fivetran.github.io/dbt_ad_reporting/#!/overview
Apache License 2.0
150 stars 56 forks source link

[Bug] Unable to Find Source Schemas on Database that is Different From the Target Database (Multi-Database Setup) #126

Open Kryt87 opened 1 day ago

Kryt87 commented 1 day ago

Is there an existing issue for this?

Describe the issue

Selecting the source database using relevant <connector>_database (reddit_ads_database etc.) variables in the dbt_project.yml file is not working as expected when the target database is not the same as the database where the data is stored. In this instance many "Please be aware" messages about being unable to find source tables in the database.

After reviewing the debug level logs and running a few tests, I observed that when running the staging models, a query against the Redshift information_schema.tables table is run but is pointing to the information_schema schema on the target database and not the data source database. Which is why the tables cannot be found.

Relevant error log or model output

CLI Output:

Please be aware: The CAMPAIGN_REPORT table was not found in your REDDIT_ADS schema(s). The Fivetran dbt package will create a completely empty CAMPAIGN_REPORT staging model as to not break downstream transformations. To turn off these warnings, set the `fivetran__remove_empty_table_warnings` variable to TRUE (see https://github.com/fivetran/dbt_fivetran_utils/tree/releases/v0.4.latest#union_data-source for details).

Debugging Logs:

select
        table_catalog as database,
        table_name as name,
        table_schema as schema,
        'table' as type
    from information_schema.tables
    where table_schema ilike 'fivetran_reddit_ads'
    and table_type = 'BASE TABLE'
    union all
    select
      table_catalog as database,
      table_name as name,
      table_schema as schema,
      case
        when view_definition ilike '%create materialized view%'
          then 'materialized_view'
        else 'view'
      end as type
    from information_schema.views
    where table_schema ilike 'fivetran_reddit_ads'
01:29:19.613576 [debug] [Thread-3 (]: SQL status: SUCCESS in 0.133 seconds
01:29:19.614802 [debug] [Thread-3 (]: While listing relations in database=ingestion, schema=fivetran_reddit_ads, found: 

Expected behavior

Expected that the required source tables could be identified in the ingestion database. There is no issue if the target database is the same as the target database.

Possible solution

The redshift__list_relations_without_caching appears to be what is used to list the tables in a schema. Not sure what needs to happen to get it pointing at the correct database.

dbt Project configurations

vars:
  linkedin_ads_database: ingestion
  linkedin_ads_schema: fivetran_linkedin_ads
  pinterest_database: ingestion
  pinterest_schema: fivetran_pinterest_ads
  reddit_ads_database: ingestion
  reddit_ads_schema: fivetran_reddit_ads

Package versions

What database are you using dbt with?

redshift

How are you running this dbt package?

dbt Core™

dbt Version

Core:

Plugins:

Additional Context

No response

Are you willing to open a PR to help address this issue?

fivetran-joemarkiewicz commented 1 day ago

Hi @Kryt87 thanks for opening this issue, but sorry to hear you're running into this problem.

Quick question before diving into what could be happening here. What is the name of your source database you would expect the package to be reading from as opposed to the target? In your reddit_ads_database variable I see you have defined ingestion, and then I can see in the log that ingestion was used for the query where no campaign_report table was found.

While listing relations in database=ingestion, schema=fivetran_reddit_ads

Is the ingestion database the source or is there a different database you would like the package to read from?

Kryt87 commented 1 day ago

Hi Joe, thanks for jumping on this so fast. The source database is called ingestion, where Fivetran is writing to and we want to read from. Our target database, where we want the models to reside is called prod.

fivetran-joemarkiewicz commented 1 day ago

Thanks for clarifying that! It does seem like from the logs that the package is reading from the ingestion database 🤔

Would you be able to confirm that in this case you do see the campaign_report table in the ingestion.fivetran_reddit_ads location in your Redshift destination?

Kryt87 commented 1 day ago

I can confirm that the campaign_report table exists in ingestion.fivetran_reddit_ads.

However, when I ran a few tests I was able to confirm that the packages was not reading from the ingestion database. In a dev target database I created the dummy table dbt_kurt_drew.fivetran_reddit_ads.test_reddit_ads. I then ran it again, where the debugging logs showed:

03:13:30.546025 [debug] [Thread-1 (]: While listing relations in database=ingestion, schema=fivetran_reddit_ads, found: test_reddit_ads
03:13:30.549303 [warn ] [Thread-1 (]: 

Please be aware: The CAMPAIGN_REPORT table was not found in your REDDIT_ADS schema(s).

This dummy table does not exist in the ingestion database.

fivetran-joemarkiewicz commented 1 day ago

Thanks for sharing. One last question before diving deeper - do you have any other configs (outside of what you already shared in the issue description) in your dbt_project.yml vars or models sections that could be causing the package to read from the incorrect location?

Kryt87 commented 1 day ago

Nothing that stands out to me 😞

fivetran-joemarkiewicz commented 11 hours ago

Thanks for checking @Kryt87, I was able to dig into this deeper and believe I may have found the culprit here.

I was able to recreate this error on Redshift. I did notice that this worked as expected on all other destinations which was a bit odd to me. However, for Redshift I saw the exact same behavior as you mentioned. I then went under the hood and was inspecting these lines which are run during the first query step of the Ad Reporting models and found that the proper source database.schema.table was being used. However, the relation.value would always be None and therefore result in the empty table warning you're seeing.

So it seems the package is interpreting the variables properly, but the actual relation.value is not working as expected. I then decided to hardcode a select * from database.schema.value in one of my models and I immediately saw the below error.

Database Error in model stg_reddit_ads__ad_tmp (models/tmp/stg_reddit_ads__ad_tmp.sql)
  Datasharing is not enabled on this cluster

This would make sense why the variables are working as expected, but then the actual relation is failing since datasharing isn't enabled on my end. Do you by chance have datasharing enabled on your end to facilitate this? I imagine if you're able to enable this then you'll be able see success when reading from one database and write to another.

Let me know if this helps address your issue!

Kryt87 commented 11 hours ago

I'm glad you are seeing the same error, I find Redshift can be very particular about this kind of thing. Regarding data sharing, it is enabled for the cluster. However, there are no data shares set up between these two databases and there shouldn't need to be as they are on the same cluster. Data sharing is primarily designed to facilitate sharing data across different Redshift clusters, which can be in different AWS accounts or regions.

When testing, I created both a staging model (view) that sourced the campaign_report and referenced it in an intermediate model. This all worked as expected, even though the source was on the ingestion database and the staging and intermediate models were created on a seperate dev databases.

fivetran-joemarkiewicz commented 10 hours ago

I also was confused why the data sharing was not enabled for the same cluster, but that's the error I was seeing.

It's also strange that you're able to query directly from the source in that staging model yet you don't see the error I saw. Could you share your test staging model view and the log result? I'm wondering if something may be causing issues at the adapter.get_relations level within dbt-redshift itself 🤔

Kryt87 commented 10 hours ago

I think you are on the right track with dbt-redshift.

The view is rather simple:

with source as (

    select * from {{ source('fivetran_reddit_ads', 'campaign_report') }}

),

renamed as (

    select * from source

)

select * from renamed

With the source being:


sources:
  - name: fivetran_reddit_ads
    description: ""
    database: ingestion

    tables:
      - name: campaign_report
        description: ""

debug.log - slimmed down debugging logs