fivetran / dbt_linkedin_pages

Fivetran data models for LinkedIn Company Pages built using dbt.
https://fivetran.github.io/dbt_linkedin_pages/
Apache License 2.0
2 stars 4 forks source link

[Bug] The linkedin_pages__post model doesn't handle post history correctly #3

Closed simon-stepper closed 2 years ago

simon-stepper commented 2 years ago

Is there an existing issue for this?

Current Behavior

We see failing tests for linkedin_pages__posts.sql for a non-unique combination of columns of ugc_post_id & source_relation. image

Expected Behavior

The model is filtering to always use the latest version of this post and hence doesn't produce duplicated rows.

Steps To Reproduce

No response

Relevant log output

10:07:37  Completed with 1 error and 0 warnings:
10:07:37  
10:07:37  Failure in test dbt_utils_unique_combination_of_columns_linkedin_pages__posts_ugc_post_id__source_relation (models/linkedin_pages.yml)
10:07:37    Got 2 results, configured to fail if != 0
10:07:37  
10:07:37    compiled SQL at target/compiled/linkedin_pages/models/linkedin_pages.yml/dbt_utils_unique_combination_o_321ceac5241ebed20b1e3af34d1381f6.sql

What database are you using dbt with?

snowflake

Additional Context

The primary key for ugc_post_history is the combination of id & last_modified_time according to the ERD provided for the LinkedIn Company Pages connector. When posts get edited, they get a new row in this table, which can be identified by either the last_modified_time or the version_tag columns. image A similar solution to what is already used in int_linkedin_pages__latest_post.sql could be introduced to only use the most recent record for a post.

Are you willing to open a PR to help address this issue?

simon-stepper commented 2 years ago

Fivetran support confirmed that the raw data behaviour is as expected in https://support.fivetran.com/hc/en-us/requests/55700.

fivetran-joemarkiewicz commented 2 years ago

@NoToWarAlways thanks so much for this detailed bug report write up!

Based on the supporting information you provided, it is clear to me that the linkedin_pages__posts needs to filter for the most recent ugc_post_id & source_relation record. I just looked through our own data and recognized we have not edited any past posts which is most likely why this test hadn't failed on our end. Glad the test did it's job though in raising this issue!

I will review your PR above and let you know if I have any requested changes or comments before we merge. Thanks again so much. These contributions help make our packages better. We really appreciate it 💯