This PR will address the following Issue/Feature: #92
This PR will result in the following new package version:v0.13.0
This will require a full refresh following the upgrade. As such, this should be a breaking change.
Please detail what change(s) this PR introduces and any additional information that should be known during the review of this PR:
At the core of this change we are including the status_id field to the int_jira__field_history_scd and jira__daily_issue_field_history models. This is required because as identified in the linked Issue on an incremental run we are currently trying to join the statuses using status_id; however, we adjusted our package back in v0.9.0 to produce the status name instead of the id. Therefore, this join works only on the non incremental run as the data in the joined cte is providing the status_id. Once an initial run is complete however the status field is now producing the name and therefore will never have a successful join.
I originally attempted to simply adjust the join logic to join on status_id for non incremental runs, and then join on status_name for incremental runs. However, although Jira mentions a status will never have the same name, that does not seem to be the case in the Jira synced data via our Fivetran connector. There in fact are possible duplicates. I wonder if this has to deal with deleted status' that we do not capture. Nevertheless, if we tried to do this join on status_name it would result in a fan out as it is a many to many relationship.
Therefore, in order to address this situation while still producing the status_name in the final model, I decided to bring in the status_id as a default (and permanent) field. This way we are able to confidentially join on the status_id in downstream models while also producing the accurate status_name without risk of a fan out.
PR Checklist
Basic Validation
Please acknowledge that you have successfully performed the following commands locally:
[X] dbt compile
[X] dbt run –full-refresh
[X] dbt run
[X] dbt test
[X] dbt run –vars (if applicable)
I also ran with the daily_field_history variable set to join a few other fields just to ensure nothing was breaking with that logic.
Before marking this PR as “ready for review” the following have been applied:
[X] The appropriate issue has been linked and tagged
[X] You are assigned to the corresponding issue and this PR
[x] BuildKite integration tests are passing
Detailed Validation
Please acknowledge that the following validation checks have been performed prior to marking this PR as “ready for review”:
[X] You have validated these changes and assure this PR will address the respective Issue/Feature.
[X] You are reasonably confident these changes will not impact any other components of this package or any dependent packages.
[X] You have provided details below around the validation steps performed to gain confidence in these changes.
To validate this change I wanted to particularly test the incremental strategy and it's affects when running on new and stale data as well as recreate the identified issue.
Recreate the issue
I recreated the issue by simply looking at a particular issue_id (10049) from the jira__daily_issue_field_history output model and checked to see the results of a full refresh run and an incremental run on the latest version of the package.
Full Refresh run (looks good!)
Incremental run (Uh oh looks like the incremental strategy is not working as intended)
New Data Tests
To begin I started by looking at an individual issue_id (10049) from the jira__daily_issue_field_history output model and validated that the issue had 4 days and two previous statuses ("Selected for Development" and "In Progress") before being marked as Done on 2020-11-12. See the pic for details.
To then test the incremental strategy I artificially limited the data (via the int_jira__issue_calendar_spine model) for 2020-11-10 and did a full refresh to reset the model to that data. See pics below.
I then indexed the date by one day to ensure the "In Progress" date properly loaded next following a non full refresh (incremental) run. See pics below.
As another test I did the same, but wanted to check it worked on an additional few days bump. See pics below.
Woohoo it worked!! 🎉
Stale Data Tests
To check stale data I just removed my artificial date filter from the date spine and did a full refresh and then a normal run to make sure all looked good. See pics below.
Full refresh as of today
Normal run as of today (same results as expected)
Standard Updates
Please acknowledge that your PR contains the following standard updates:
Package versioning has been appropriately indexed in the following locations:
[X] indexed within dbt_project.yml
[X] indexed within integration_tests/dbt_project.yml
[X] CHANGELOG has individual entries for each respective change in this PR
[X] README updates have been applied (if applicable)
[X] DECISIONLOG updates have been updated (if applicable)
[X] Appropriate yml documentation has been added (if applicable)
dbt Docs
Please acknowledge that after the above were all completed the below were applied to your branch:
[ ] docs were regenerated (unless this PR does not include any code or yml updates)
I believe we should hold off on the docs since this will be batched together with other changes in the next release.
If you had to summarize this PR in an emoji, which would it be?
PR Overview
This PR will address the following Issue/Feature: #92
This PR will result in the following new package version:
v0.13.0
This will require a full refresh following the upgrade. As such, this should be a breaking change.
Please detail what change(s) this PR introduces and any additional information that should be known during the review of this PR:
At the core of this change we are including the
status_id
field to theint_jira__field_history_scd
andjira__daily_issue_field_history
models. This is required because as identified in the linked Issue on an incremental run we are currently trying to join the statuses using status_id; however, we adjusted our package back in v0.9.0 to produce the status name instead of the id. Therefore, this join works only on the non incremental run as the data in the joined cte is providing the status_id. Once an initial run is complete however the status field is now producing the name and therefore will never have a successful join.I originally attempted to simply adjust the join logic to join on status_id for non incremental runs, and then join on status_name for incremental runs. However, although Jira mentions a status will never have the same name, that does not seem to be the case in the Jira synced data via our Fivetran connector. There in fact are possible duplicates. I wonder if this has to deal with deleted status' that we do not capture. Nevertheless, if we tried to do this join on status_name it would result in a fan out as it is a many to many relationship.
Therefore, in order to address this situation while still producing the status_name in the final model, I decided to bring in the status_id as a default (and permanent) field. This way we are able to confidentially join on the status_id in downstream models while also producing the accurate status_name without risk of a fan out.
PR Checklist
Basic Validation
Please acknowledge that you have successfully performed the following commands locally:
I also ran with the daily_field_history variable set to join a few other fields just to ensure nothing was breaking with that logic.
Before marking this PR as “ready for review” the following have been applied:
Detailed Validation
Please acknowledge that the following validation checks have been performed prior to marking this PR as “ready for review”:
To validate this change I wanted to particularly test the incremental strategy and it's affects when running on new and stale data as well as recreate the identified issue.
Recreate the issue
10049
) from thejira__daily_issue_field_history
output model and checked to see the results of a full refresh run and an incremental run on the latest version of the package.New Data Tests
10049
) from thejira__daily_issue_field_history
output model and validated that the issue had 4 days and two previous statuses ("Selected for Development" and "In Progress") before being marked as Done on 2020-11-12. See the pic for details.int_jira__issue_calendar_spine
model) for 2020-11-10 and did a full refresh to reset the model to that data. See pics below.Stale Data Tests
Standard Updates
Please acknowledge that your PR contains the following standard updates:
dbt Docs
Please acknowledge that after the above were all completed the below were applied to your branch:
I believe we should hold off on the docs since this will be batched together with other changes in the next release.
If you had to summarize this PR in an emoji, which would it be?
🧍♂️