fivetran / dbt_hubspot

Data models for Hubspot built using dbt.
https://fivetran.github.io/dbt_hubspot/
Apache License 2.0
32 stars 38 forks source link

Deal Pipeline Stage Label is duplicated #75

Closed moseleyi closed 1 year ago

moseleyi commented 1 year ago

Is there an existing issue for this?

Describe the issue

Here is an extract for a deal that has two different stages:

image

The stage_ids are properly mapped in stg_hubspot__deal_pipeline_stage

But then, hubspot__deal_stages shows the same id for both stages

image

Relevant error log or model output

No response

Expected behavior

The first row should have Pipeline_stage_label = New and Deal_pipeline_stage_id = 6309924

dbt Project configurations

vars:
  hubspot_source:
    hubspot_database: FIVETRAN
    hubspot_schema: hubspot
  hubspot_email_event_print_enabled: false
  hubspot_email_event_forward_enabled: false
  hubspot_contact_merge_audit_enabled: true
  hubspot__pass_through_all_columns: true

Package versions

packages:

What database are you using dbt with?

snowflake

dbt Version

1.1.1

Additional Context

No response

Are you willing to open a PR to help address this issue?

fivetran-joemarkiewicz commented 1 year ago

Hi @moseleyi thanks so much for opening this Issue!

This definitely seems like a behavior that is not intended and should be addressed. I did some querying on my side as well and found the same results you are mentioning where the same stage is applied to each stage in the final model for the deal.

I looked a bit closer and found that we seem to filter out inactive stages in our staging model. https://github.com/fivetran/dbt_hubspot_source/blob/3888cde8d1d87ef65e406a9511509bb2af5128d1/models/stg_hubspot__deal_pipeline_stage.sql#L37

This in turn ensures that each stage in the hubspot__deal_stages only references the active stage (which I imagine is not the desired outcome). It seems that removing that where clause in the staging model could resolve this issue. Do you see any issue on your end with including inactive stages in the downstream transformations?

moseleyi commented 1 year ago

Since it's a historical table, the active flag is completely redundant, as we want to see all past stages regardless of their current status.

I ran the models without the WHERE clause you mentioned. Suddenly the deal ID has disappeared from hubspot__deal_stages AND hubspot__deals, even though it's still there in stg_hubspot__deal_stage.

I'm not sure this is it because when I look at stg_hubspot__deal_pipeline_stage_tmp all the stages have FALSE in _fivetran_deleted:

image

moseleyi commented 1 year ago

Any news? It's critical bug as it stops being able to analyse deal stages

fivetran-joemarkiewicz commented 1 year ago

Hi @moseleyi thanks for getting back with these details. I was able to look into this some more and found that removing the filter didn't have the same effect you are mentioning where the deal_id seems to be removed. Would you be able to swap your packages.yml dependency on the hubspot package for the one below:

packages:
  - git: https://github.com/fivetran/dbt_hubspot.git
    revision: bugfix/deleted-stages
    warn-unpinned: false

Let me know if this resolves the issue on your end!

moseleyi commented 1 year ago

Still seeing the same results:

hubspot.deal_stages image

stg_hubspot__deal_stage image

moseleyi commented 1 year ago

This code:

 select
        deal_stage.deal_id || '-' || row_number() over(partition by deal_stage.deal_id order by deal_stage.date_entered asc) as deal_stage_id,
        deal_stage.deal_stage_name, 
        deal_stage._fivetran_start as date_stage_entered,
        deal_stage._fivetran_end as date_stage_exited,
        deal_stage._fivetran_active as is_stage_active,
        deals_enhanced.*

    from deal_stage

    left join deals_enhanced
        on deal_stage.deal_id = deals_enhanced.deal_id

Is a little bit weird. Why do we bring all the deals' data to deal stages? That's why my query returns the label only of the current one. I expected the pipeline_stage_label to be related to deal_stage_name not to the current deal's one

moseleyi commented 1 year ago

I added this:

SELECT
  deal_stage_id,
  deal_stage_name,
  date_stage_entered,
  date_stage_exited,
  deal_id,
  dps.pipeline_stage_label,
  dps.deal_pipeline_id,
  dp.pipeline_label
FROM final
  LEFT JOIN analytics.base_hubspot.stg_hubspot__deal_pipeline_stage dps ON dps.deal_pipeline_stage_id = final.deal_stage_name
  LEFT JOIN analytics.base_hubspot.stg_hubspot__deal_pipeline dp ON dp.deal_pipeline_id = dps.deal_pipeline_id
WHERE
  deal_id = 9631290514

To get the name of each stage, and here are the results for the same deal:

image

fivetran-jamie commented 1 year ago

hey @moseleyi - so yes you are correct in that we are only grabbing pipeline stage information for the deal's current stage, which doesn't make sense for the hubspot__deal_stages model.

i've adapted your logic from the query above and joined in information for each stage if you wanna try out this branch again!

packages:
  - git: https://github.com/fivetran/dbt_hubspot.git
    revision: bugfix/deleted-stages
    warn-unpinned: false
moseleyi commented 1 year ago

Thanks @fivetran-jamie for the update. Because we bring all the data from Deals to Deal Stages, I'm still confused as to which field corresponds to the stages. pipeline_stage_label is stil a current stage because it's being taken from deals. Bringing all the data from parent to child table causes a lot of confusion especially since it's a historical table. There's no differentiation in hubspot__deal_stages as to which fields are historical (deal_stage_name, deal_stage_id) and which ones are current (pipeline_stage_label, deal_pipeline_stage_id).

Two things:

  1. I wouldn't bring all the data from parent to this table
  2. Pipeline Stage Label is still current one and doesn't correspond to deal_stage_id

That's because Deal Stage Name is still a numerical way of hubspot naming the stages. What we need is a label: image

fivetran-jamie commented 1 year ago

@moseleyi I completely agree -- having the parent deal data in there is confusing (and unnecessary, as one could just join with the hubspot__deals model)

i've removed the deal info (with the exceptions of deal_id and deal_name) from the bugfix/deleted-stages branch if you could re-run and check if things seem clearer/correct

moseleyi commented 1 year ago

Do I need to update the package information?

  - git: https://github.com/fivetran/dbt_hubspot.git
    revision: bugfix/deleted-stages
    warn-unpinned: false

When I do dbt build, I can't even see the latest code

image

fivetran-jamie commented 1 year ago

oh hm, if you're in dbt cloud i don't think you can dbt clean, but did you dbt deps after switching the hubspot package reference?

moseleyi commented 1 year ago

You need to change reference to the deal name field, it's dealname not deal_name from deals_enhanced

moseleyi commented 1 year ago

Apart from that, things are working as expected :)

fivetran-jamie commented 1 year ago

great! i'm a little confused about the dealname issue though...i believe we rename the column to deal_name in the staging model -- are you seeing an error somewhere?

moseleyi commented 1 year ago

Indeed I do see an error (after doing dbt deps and dbt build): image

When I look at Fivetran source hubspot.deal table it's property_dealname

Then it's renamed to to deal_name in stg_hubspot__deal but only in one of the IF statements

{% if var('hubspot__pass_through_all_columns', false) %}
        -- just pass everything through
        {{ 
            fivetran_utils.remove_prefix_from_columns(
                columns=adapter.get_columns_in_relation(ref('stg_hubspot__deal_tmp')), 
                prefix='property_') 
        }}
    from base

{% else %}
        -- just default columns + explicitly configured passthrough columns
        _fivetran_synced,
        is_deleted,
        deal_id,
        cast(deal_pipeline_id as {{ dbt_utils.type_string() }}) as deal_pipeline_id,
        cast(deal_pipeline_stage_id as {{ dbt_utils.type_string() }}) as deal_pipeline_stage_id,
        owner_id,
        portal_id,
        property_dealname as deal_name,
        property_description as description,
        property_amount as amount,
        property_closedate as closed_at,
        property_createdate as created_at

        --The below macro adds the fields defined within your hubspot__deal_pass_through_columns variable into the staging model
        {{ fivetran_utils.fill_pass_through_columns('hubspot__deal_pass_through_columns') }}

        -- The below macro add the ability to create calculated fields using the hubspot__deal_calculated_fields variable.
        {{ fivetran_utils.calculated_fields('hubspot__deal_calculated_fields') }}

    from macro
{% endif %}

But my value for this variable is actually true (so it should be renamded) but if I look into stg_hubspot__deal the name is dealname.

Shouldn't the statement in the IF say {% if var('hubspot__pass_through_all_columns', true) %} ?

fivetran-jamie commented 1 year ago

Ohhhh great catch ... yeah we're renaming property_dealname (and property_closedate and property_createdate) inconsistently, depending on the value of hubspot__pass_through_all_columns

I've just pushed some code that renames these columns to deal_name, closed_at, and created_at regardless of your value for hubspot__pass_through_all_columns if you could re-install and try it out! thanks

moseleyi commented 1 year ago

Works now : )

moseleyi commented 1 year ago

@fivetran-jamie Do I need to wait for the release?

image

fivetran-jamie commented 1 year ago

ah yeah the release should come out tomorrow -- we were rolling out the source package release but uncovered a bug with dbt core that impacts some new functionality in the packages

moseleyi commented 1 year ago

Thanks!

fivetran-jamie commented 1 year ago

release is live so closing this one out!