fivetran / dbt_pinterest_source

Fivetran data models for Pinterest Ads built using dbt.
https://fivetran.github.io/dbt_pinterest_source/
Apache License 2.0
2 stars 7 forks source link

[Bug] <Column Combination Uniqueness> #20

Closed jarlisle closed 1 year ago

jarlisle commented 1 year ago

Is there an existing issue for this?

Describe the issue

I've upgraded from 0.5.0 to 0.7.0 and in 0.6.0 you've implemented some new tests. One of which is the unique combination of columns on the stg_pinterest_ads__keyword_report model for the following columns:

In our implementation, that errors out with 466K records. In checking our data it becomes unique when you add in pin_promotion_id. I haven't been able to find any documentation to verify my thought that it's not unique without pin_promotion_id, but I'm curious to know if you found documentation that specifies the 5 fields above as unique.

Relevant error log or model output

No response

Expected behavior

I expect the test to error out if it fails. I'm just curious if the test should include an additional field.

dbt Project configurations

name: 'mathis_brothers_dbt' version: '1.0.0' config-version: 2

profile: 'mathis'

model-paths: ["models"] analysis-paths: ["analyses"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] snapshot-paths: ["snapshots"]

target-path: "target" # directory which will store compiled SQL files clean-targets: # directories to be removed by dbt clean

models: mathis_brothers_dbt: reporting: materialized: view +schema: reporting intermediate: +schema: intermediate marts: ga4: events: org_specific: +schema: custom_events staging: ga4: events: org_specific: +schema: stg_custom_events

admind_dbt_ga4: staging: +schema: staging ga4: stg_ga4__events: materialized: table stg_ga4event_params: materialized: table stg_ga4event_user_properties: materialized: table marts: ga4: aggregated_reports: +schema: reporting processed_events: +schema: reporting intermediate: +schema: intermediate events: admind_standard: +schema: custom_events builtin: +schema: standard_events ecommerce: +schema: ecommerce

vars: admind_dbt_ga4: ga4_enums: ui_name: 'Google Analytics 4' db_name: 'ga4' system_type: 'web_analytics' not_set_value: '(not set)'

ga4_sources:
  # Note this currently only supports a single ga4 source.
  - name: ga4
    ui_name: Mathis Brothers
    business_unit: Mathis Brothers
    schema: analytics_272698041
    # optional database if different than target
    # database: YOUR_DATABASE
ga4_marts:
# To inject org-specific logic into a mart:
# Build an intermediate table(s),
# add an entry here for the mart name,
# use the `source_model` parameter to specify the intermediate table
  ga4__events:
    source_model: int__ga4__events
  ga4__event__1000_dollar_survey:
    source_model: int__ga4__event__1000_dollar_survey

ga4_config:
  ga4_start_date: '20220920'
  ga4_timezone: 'America/Denver'
  ga4_client_timestamp_parameter: 'client_timestamp_millis'
  # GA4 adMind-Specific Config
  ga4_engagement_max_seconds_per_event: 15

  builtin_events:
    - name: click
      event_params:
        - name: outbound
        - name: link_classes
        - name: link_domain
        - name: link_url
        - name: click_element
        - name: link_id
        - name: click_region
        - name: click_tag_name
        - name: click_url
        - name: file_name
      user_properties: [ ]
    - name: file_download
      event_params:
        - name: file_extension
        - name: file_name
        - name: link_classes
        - name: link_domain
        - name: link_id
        - name: link_text
        - name: link_url
      user_properties: [ ]
    - name: first_visit
      event_params: [ ]
      user_properties: [ ]
    - name: page_view
      event_params: [ ]
      user_properties: [ ]
    - name: scroll
      event_params:
        - name: percent_scrolled
          key: int_value
      user_properties: [ ]
    - name: session_start
      event_params: [ ]
      user_properties: [ ]
    - name: user_engagement
      event_params: [ ]
      user_properties: [ ]
    - name: video_complete
      event_params: [ ]
      user_properties: [ ]
    - name: video_complete
      event_params:
        - name: video_current_time
          key: int_value
        - name: video_duration
          key: int_value
        - name: video_percent
          key: int_value
        - name: video_url
        - name: video_provider
        - name: vide_title
        - name: visible
      user_properties: [ ]
    - name: video_start
      event_params:
        - name: video_current_time
          key: int_value
        - name: video_duration
          key: int_value
        - name: video_percent
          key: int_value
        - name: video_url
        - name: video_provider
        - name: vide_title
        - name: visible
      user_properties: [ ]
    - name: view_search_results
      event_params:
        - name: search_term
        - name: unique_search_term
      user_properties: [ ]

  ecommerce_events:
    # Standard ecommerce events listed here will trigger creation of staging and marts with specific fields for
    # each event type as documented here https://developers.google.com/analytics/devguides/collection/ga4/reference/events
    # If desired, additional event_params and user_properties can be added as with other events.
    - name: add_payment_info
    - name: add_shipping_info
    - name: add_to_cart
    - name: add_to_wishlist
    - name: begin_checkout
    - name: purchase
    - name: refund
    - name: remove_from_cart
    - name: select_item
    - name: select_promotion
    - name: view_cart
    - name: view_item
    - name: view_item_list
    - name: view_promotion

  custom_events:
    # adMind Standard
    # - name: engagement
    #   event_params:
    #     - name: engagement_time_msec
    #       key: int_value
    #   user_properties: [ ]
    # - name: ip_address_received
    #   event_params:
    #     - name: ip_address
    #   user_properties: [ ]
    # Org Specific events go here
    - name: video_progress
      event_params:
      - name: video_url
      - name: video_title
      - name: video_current_time
        key: int_value
      - name: video_duration
        key: int_value
      - name: visible
      - name: video_provider
      - name: video_percent
        key: int_value
    - name: apply for financing
      event_params:
      - name: click text
    - name: zipCodeApply
      event_params:
      - name: tax
        key: double_value
      - name: shipping
        key: double_value
      - name: zip_code
        key: int_value

- name: $1000 Survey

event_params:

- name: URL

- name: User ID

key: int_value

  channel_grouping:
    - channel: Paid Search - Branded
      column_rules:
        - column: source
          regex_pattern: "google|bing|yahoo|msn|gemini"
        - column: medium
          regex_pattern: "cpc"
        - column: campaign
          regex_pattern: "[Bb]rand"
    - channel: Paid Search - Non-Branded
      column_rules:
        - column: source
          regex_pattern: "google|bing|yahoo|msn|gemini"
        - column: medium
          regex_pattern: "cpc"
    - channel: Paid Social
      column_rules:
        - column: source
          regex_pattern: "facebook|twitter|linkedin|yelp|yotpo|igshopping|tiktok|instagram|reddit"
        - column: medium
          regex_pattern: "cpc|paid"
    - channel: Organic Social
      column_rules:
        - column: source
          regex_pattern: "facebook|twitter|linkedin|yelp|yotpo|igshopping|tiktok|instagram|reddit"
    - channel: Organic Social
      column_rules:
        - column: referrer
          regex_pattern: facebook|twitter|linkedin|yelp|yotpo|igshopping|tiktok|t\.co|instagram|reddit
    - channel: Display - Retargeting
      column_rules:
        - column: source
          regex_pattern: "display|cpm|banner|google|tabola"
        - column: campaign
          regex_pattern: "remarketing|retargeting"
    - channel: Display - Marketing
      column_rules:
        - column: medium
          regex_pattern: "display|cpm|banner"
        - column: campaign
          regex_pattern: "display"
    - channel: Email
      column_rules:
        - column: medium
          regex_pattern: email|Email
    - channel: Organic Search
      column_rules:
        - column: referrer
          regex_pattern: search|duckduckgo\.com|start\.att\.net|ecosia\.org|msn\.com|dogpile\.com|wowway.com|cox\.net|hidemyhistory\.co|start\.me|yandex|bing|google|360\.cn|alice\.com|aliceadsl\.fr|alltheweb\.com|altavista\.com|aol\.com|ask\.com|search\.aol\.fr|alicesuche\.aol\.de|baidu|biglobe|centrum\.cz|cnn\.com\/SEARCH|daum\.net|ecosia|ekolay|eniro\.se|globo\.com\/busca\/|go\.mail\.ru|goo\.ne\.jp|haosou\.com|kavsir\.no|mynet|najdi|naver\.com|szukaj\.onet\.pl|rakuten\.co\.jp|search-results|sesam|seznam|so\.com\/s|sogou|startsiden|szukacz|buscador\.terra\.com|voila\.fr|wp\.pl|yahoo|yam\.com|duckduckgo
    - channel: Organic Search
      column_rules:
        - column: medium
          regex_pattern: "organic"
        - column: source
          regex_pattern: "google|bing|yahoo|msn|gemini|duckduckgo"
    - channel: Direct
      column_rules:
        - column: source
          regex_pattern: "direct"
    - channel: Referral
      column_rules:
        - column: referrer
          regex_pattern: ".{1}"  # hacky not null regex

  columns_to_keep_on_all_event_dim_tables:
    - event_id
    - event_name
    - event_timestamp
    - event_value_in_usd
    - user_id
    - user_pseudo_id
    - ga_session_id
    - ga_session_number
    - privacy_info_analytics_storage
    - privacy_info_ads_storage
    - privacy_info_uses_transient_token
    - device_category
    - device_mobile_brand_name
    - device_mobile_marketing_name
    - device_mobile_model_name
    - device_operating_system
    - device_operating_system_version
    - device_language
    - device_is_limited_ad_tracking
    - device_advertising_id
    - device_vendor_id
    - device_browser
    - device_browser_version
    - geo_continent
    - geo_country
    - geo_region
    - geo_city
    - geo_sub_continent
    - stream_id
    - platform
    - business_unit
    - page_title
    - page_url
    - page_referrer
    - page_protocol
    - page_hostname
    - page_port
    - page_path
    - page_query
    - page_fragment
    - page
    - traffic_source_name
    - traffic_source_medium
    - traffic_source_source
  parameters_from_url:
    - event_parameter: gclid
      url_parameter: gclid
      user_property: gclid
      min_length: 55
    - event_parameter: msclkid
      url_parameter: msclkid
      user_property: msclkid
      min_length: 32
    - event_parameter: twclid
      url_parameter: twclid
      user_property: twclid
      min_length: 0
    - event_parameter: li_fat_id
      url_parameter: li_fat_id
      user_property: li_fat_id
      min_length: 0
    - event_parameter: fbclid
      url_parameter: fbclid
      user_property: fbclid
      min_length: 0
    - event_parameter: rdt_cid
      url_parameter: rdt_cid
      user_property: rdt_cid
      min_length: 0
    - event_parameter: utm_source
      url_parameter: utm_source
      user_property: utm_source
      min_length: 0
    - event_parameter: utm_medium
      url_parameter: utm_medium
      user_property: utm_medium
      min_length: 0
    - event_parameter: utm_campaign
      url_parameter: utm_campaign
      user_property: utm_campaign
      min_length: 0
    - event_parameter: utm_content
      url_parameter: utm_content
      user_property: utm_content
      min_length: 0
    - event_parameter: utm_term
      url_parameter: utm_term
      user_property: utm_term
      min_length: 0
  persisting_dimensions:
    last_known:
      general:
        - event_parameter: gclid
          user_property: gclid
          destination_column: gclid
          prefer: user_property
        - event_parameter: msclkid
          user_property: msclkid
          destination_column: msclkid
          prefer: user_property
        - event_parameter: twclid
          user_property: twclid
          destination_column: twclid
          prefer: user_property
        - event_parameter: li_fat_id
          user_property: li_fat_id
          destination_column: li_fat_id
          prefer: user_property
        - event_parameter: fbclid
          user_property: fbclid
          destination_column: fbclid
          prefer: user_property
        - event_parameter: rdt_cid
          user_property: rdt_cid
          destination_column: rdt_cid
          prefer: user_property
      org_specific: [ ]
    session:
      general: [ ]
      org_specific: [ ]
    hit:
      general:
        - event_parameter: ga_session_id,
          destination_column: system_session_id
        - event_parameter: ga_session_number,
          destination_column: system_session_number
        - event_parameter: session_engaged,
          destination_column: system_session_is_engaged
      org_specific:
        - event_parameter: user_interaction
          destination_column: is_user_interaction
        - event_parameter: optimizely_experiment_id
          destination_column: optimizely_experiment_id
        - event_parameter: gtm_container_version
          destination_column: gtm_container_version
        - event_parameter: gtm_container_id
          destination_column: gtm_container_id
        - event_parameter: link_domain
          destination_column: link_domain
        - event_parameter: link_classes
          destination_column: link_classes
        - event_parameter: link_text
          destination_column: link_text
        - event_parameter: form_id
          destination_column: form_id

Package versions

packages:

What database are you using dbt with?

bigquery

dbt Version

1.3

Additional Context

No response

Are you willing to open a PR to help address this issue?

fivetran-joemarkiewicz commented 1 year ago

Hi @jarlisle thanks so much for opening this issue.

That is interesting that you are experiencing this issue. Our understanding was that keywords were unique to the ad group and not to the pin promotion. That being said, this may be the case for some pinterest ad accounts for which we may have not accounted.

Would you be able to confirm if the keywords in your pinterest account are tied to an individual pin promotion id? If so, then this may be an adjustment that we should make to our testing philosophy.

jarlisle commented 1 year ago

@fivetran-joemarkiewicz In the data, there is not a 1:1 relationship between the keyword and the pin_promotion_id. Does that answer your question?

Edit: In our data the keyword is not unique to the ad group. There is a 1:many relationship between the ad group and the keyword.

fivetran-joemarkiewicz commented 1 year ago

Hi @jarlisle that does answer my question, thanks for the insight!

The ad group to keyword relationship is alignment with my understanding; however, I am still trying to fully understand the pin promotion relationship to keyword and how we should handle this within our test. 🤔

While I continue digging on my side, I think it would be a good exercise to see what adjusting the test to include pin_promotion_id would look like for your data. I just made a few changes to the dbt_pinterest_source and dbt_pinterest packages to include this change. If you have a moment, would you be able to use the following in place of your hub package dependency of pinterest and confirm what the results are on your end?

packages:
# Momentarily comment out for testing purposes and use the git version to install
#   - package: fivetran/pinterest
#     version: [">=0.7.0", "<0.8.0"]
  - git: https://github.com/fivetran/dbt_pinterest.git
    revision: testfix/keyword-report
    warn-unpinned: false 
jarlisle commented 1 year ago

@fivetran-joemarkiewicz, that worked without error. In investigating this I had a hunch that pin_promotion_id needed to be part of the test, but I wasn't able to confirm that anywhere in pinterest's docs.

fivetran-joemarkiewicz commented 1 year ago

Thanks so much for the confirmations @jarlisle! This is incredibly helpful and possibly something we will want to move forward with.

FYI I will be on PTO until next Tuesday. I will likely move forward with integrating this into a patch release then. Unless I come across evidence that informs this is not the right approach. Thanks!

fivetran-joemarkiewicz commented 1 year ago

Hi @jarlisle I just wanted to share an update here that I am going to have a member on my team take over this issue going forward. I am glad to see the test adjustment in the above branch is working for you! However, I want us to get a better understanding of the behavior of keywords before adjusting the test for all our users.

We hope to get a solid path forward with an official release by the end of next week!

fivetran-catfritz commented 1 year ago

@jarlisle we have released version v0.7.1, which addresses this issue. Closing this issue!