fivetran / dbt_hubspot

Data models for Hubspot built using dbt.
https://fivetran.github.io/dbt_hubspot/
Apache License 2.0
33 stars 39 forks source link

BUG - Duplicate Contacts from Merged Contacts #46

Closed fivetran-joemarkiewicz closed 3 years ago

fivetran-joemarkiewicz commented 3 years ago

Are you a current Fivetran customer?

Issue identified by Marion Pavillet

Describe the bug

Our integration to the datawarehouse is done with fivetran. Deletions are caught properly with a deleted flag that is correctly ignored by the package. Everything runs smoothly except for some tests related to contacts being unique in the email_events aggregated table:

unique_hubspot__email_event_clicks_event_id.sql
unique_hubspot__email_event_delivered_event_id.sql
unique_hubspot__email_event_opens_event_id.sql
unique_hubspot__email_event_sent_event_id.sql
unique_hubspot__email_event_status_change_event_id.sql
unique_hubspot__email_sends_event_id.sql

I did some investigation and it is due to contacts being merged in Hubspot. These contacts are not flagged as either deleted or inactive. Thus, there are duplicates in the email address (which is used in a JOIN clause and leads to duplicates downstream - specifically this SQL hubspot__email_event_sent.sql).

with base as (
    select *
    from database.schema_stg_hubspot.stg_hubspot__email_event_sent
), events as (
    select *
    from database.schema_stg_hubspot.stg_hubspot__email_event
), contacts as (
    select *
    from database.schema_stg_hubspot.stg_hubspot__contact
), events_joined as (
    select 
        base.*,
        events.created_timestamp,
        events.email_campaign_id,
        events.recipient_email_address,
        events.sent_timestamp as email_send_timestamp,
        events.sent_by_event_id as email_send_id
    from base
    left join events
        using (event_id)
), contacts_joined as (
    select 
        events_joined.*,
        contacts.contact_id
    from events_joined
    left join contacts
        on events_joined.recipient_email_address = contacts.email ---THIS IS WHERE IT IS MESSED UP
)
select *
from contacts_joined

Expected behavior

The merged contact is the contact that is used in transformations and not the older contacts.

DylanBaker commented 3 years ago

@fivetran-joemarkiewicz Hey. This makes sense. What's weird about the Hubspot data is that the information about the merge is on the contact that has been merged into, not on the contact that has been merged. There's a column called merged_vids which is an array of the contact IDs that have been merged into a given contact.

I think to filter these out we would need to grab the unique set of IDs stored in merged_vids and then filter any of those IDs out from the table. This could be done in the staging model, though we may want to leave the grain of that untouched? If that's the case, I think we could do it in an intermediate model in the transform package.

fivetran-joemarkiewicz commented 3 years ago

Thanks for looking into this @DylanBaker. I was not aware of the merged_vids column previously, but looking into it I see how it stores both the merged contact and contacts that have been merged into it.

My one curiosity is that on the dataset I have available it seems that the merged contacts no longer exist within the contact table? I wonder why the customer is still seeing these merged contacts within their dataset while I see no record of the merged contacts?

I do agree with you however, that this should be performed within an intermediate model in the transform package to leave the grain of the staging model untouched.

fivetran-joemarkiewicz commented 3 years ago

This feature has been integrated within the latest v0.4.1 release of the dbt_hubspot package. FYI the feature is disabled by default and you may enable the contact merge auditing by setting the hubspot_contact_merge_audit_enabled to true.