fivetran / dbt_hubspot

Data models for Hubspot built using dbt.
https://fivetran.github.io/dbt_hubspot/
Apache License 2.0
33 stars 39 forks source link

BUG - email_events_joined macro joins on email address which is not unique in contact #58

Closed d0m-1n-1c closed 2 years ago

d0m-1n-1c commented 2 years ago

Are you a current Fivetran customer? Dominic, Data Engineer at Rezdy

Describe the bug The macro email_events_joined has the below join

    from events_joined
    left join contacts
        on events_joined.recipient_email_address = contacts.email

We have duplicate emails in contacts (stg_hubspot__contact). These emails have different contact ids but the email address is the same. Although an edge case there doesn't appear to be a rule against it

Steps to reproduce

  1. Have two different hubspot contacts with the same email address (these will be loaded into stg_hubspot__contact) in hubspot
  2. Make sure there are emails for the contact
  3. Just run and test the dbt_hubspot package normally it should create duplicate events in the specific email type tabes such as hubspot__email_event_dropped hubspot__email_event_opens However the original email table stg_hubspot__email_event will pass on it's unique test.

To get bad rows:

WITH duplicated_emails AS (
select
    email,
    count(*) as n_records

from `rezdy-dwh`.`dbt_dominic_stg_hubspot`.`stg_hubspot__contact`
where email is not null
group by email
having count(*) > 1
)

SELECT * 
FROM  `rezdy-dwh`.`dbt_dominic_hubspot`.`hubspot__email_event_bounce`
WHERE event_id IN (SELECT event_id FROM `rezdy-dwh`.`dbt_dominic_dbt_test__audit`.`unique_hubspot__email_event_bounce_event_id`)
AND recipient_email_address in (SELECT email FROM duplicated_emails )

Expected behavior A join on contact id rather than the email address to produce a unique grain

Project variables configuration

not relevant to this issue

Package Version

  - package: fivetran/hubspot_source
    version: 0.4.2

  - package: fivetran/hubspot
    version: 0.4.1

Warehouse

Additional context thats pretty much it ey

Screenshots

Please indicate the level of urgency Annoying but not critical, its creating a bunch of failing tests but we will likely fork the package and solve with a ROW_NUMBER() /QUALIFY to get the unique grain

Are you interested in contributing to this package?

fivetran-joemarkiewicz commented 2 years ago

Hi @d0m-1n-1c thanks so much for opening this issue!

We actually may have just what is needed to filter out those duplicate contacts! We found that HubSpot provides the functionality to "merge" contacts. However, we noticed this needs to take place in a downstream model. As such, we added a variables hubspot_contact_merge_audit_enabled that will remove these duplicate contacts.

This variable is false by default. If those duplicate contacts are in fact merged, then setting this variable to true should resolve your test failures. The README should help detail what this variable does.

vars:
   hubspot_contact_merge_audit_enabled: true

Let me know if this works. If it doesn't, we should do more investigating into why these duplicate records exist 👀

d0m-1n-1c commented 2 years ago

Hi @fivetran-joemarkiewicz thanks for getting back to me so quickly!

I completely glazed over that variable!!

I've implemented it and the number of duplicates has been reduced but the tests are still failing. It looks like there are two persisting issues:

  1. Hubspot contacts that are not yet merged which we can manually handle
  2. A few contacts that are merged but still appear inside our contact_merge_audit table with values in the vid_to_merge column.

This means that the logic introduced with the variable isn't filtering them out

select 
        contacts.*
    from contacts

    left join contact_merge_audit
        on contacts.contact_id = contact_merge_audit.vid_to_merge

    where contact_merge_audit.vid_to_merge is null

Could this be an issue with the hubspot connector itself?

fivetran-joemarkiewicz commented 2 years ago

Hi @d0m-1n-1c,

After thinking about this one for a while, I am pretty sure the variable logic for filtering out merged contacts should capture all appropriate contacts. That being said, I am not entirely sure why there are still a few that are still slipping through the cracks.

I think this may be a better question geared for our Connector Support team. They will have a better idea on how to address this issue if it truly is at the connector level.

fivetran-joemarkiewicz commented 2 years ago

Hi @d0m-1n-1c

I am closing this issue as opening the support ticket would ideally have solved this issue. Please feel free to open this issue if you are still seeing the error.