fleetio / dbt-segment

Data models for Segment built using dbt (getdbt.com).
https://hub.getdbt.com/dbt-labs/segment/latest/
Apache License 2.0
10 stars 7 forks source link

Duplication of records in the stitched model when materialization defined for sessionization folder in dbt_project.yml #9

Closed efernandez1-2u closed 1 year ago

efernandez1-2u commented 1 year ago

Describe the bug

I was experiencing a duplication of records when running the segment models, specifically the segment_web_sessions__stitched model which then affected the others downstream. I was able to resolve this by removing a configuration I had made for the segment models in my dbt_project.yml but I dont quite understand how that resolved it.

Steps to reproduce

in my dbt_project.yml file I added this block to define where I wanted the models target for delivery.

dbt_segment: enabled: true +tags: [dbt_segment]

base: +tags: [] +schema: ngwt_cmbd

sessionization: materialization: incremental +tags: [] +schema: ngwt_cmbd`

the materialization: incremental line seemed to be the problem because when I removed it, no duplicates occurred in either full refreshes or regular incremental runs. I just fail to understand how that had the ability to cause an issue. The materialization is already in the model definition so would it matter if its defined in either place or both, doesnt only one superceded the other? And in this case it was the same materialization so how would it have changed things.

Expected results

expected that models would upsert/merge any records to avoid duplication

Actual results

on full refreshes, no duplicates, any incremental runs after that would duplicate records it already had found. seems it would append instead of merge

Screenshots and log output

System information

The contents of your packages.yml file: `packages:

The output of dbt --version:

1.5

The operating system you're using: Mac OS The output of python --version: Python 3.9.12

Additional context

Are you interested in contributing the fix?

0adamjones commented 1 year ago

Hi @efernandez1-2u! Apologies for the delay here. We're materializing our models as views, and to your point that model has the incremental config at the model-level which would override anything from dbt_project.yml. I tried running something like:

select 
    session_id, count(*) 
from 
    segment_web_sessions__stitched 
group by 1 
having count(2) > 1

I didn't get any results. How are you identifying the presence of duplicates?

Also, we're using a Snowflake warehouse and don't have access to a Redshift environment. I don't believe we'd be able to write a fix if there is an issue and it's specific to Redshift. This is the first issue that's been submitted since we adopted this package from dbt, and I'm not 100% familiar with all of the ins and outs of this package but I'll do my best to help! 😄

efernandez1-2u commented 1 year ago

I used a query very similar to the one you posted to prove there were duplicates. the duplicated records were those being pulled in from the 3 hour trailing window, new records did not egt duplicated but those that were already brought in that were being pulled again from the next runs 3 hour trailing window were duplicated if that makes sense. Also every time I ran the model the record count would increase by the number of records in the trailing window

chrishiste commented 1 year ago

I had a similar issue that was fixed by running dbt clean and dbt deps. Maybe it will work for you too

efernandez1-2u commented 1 year ago

appreciate the suggestion, I will give it a try

0adamjones commented 1 year ago

@efernandez1-2u Were you able to figure this one out?

0adamjones commented 1 year ago

@efernandez1-2u Going to close this one out soon if it's no longer an issue :)