[Bug] Incorrect Historical Value when Two Updates on the Same Day

jon-goldberg commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Describe the issue

int_jira__pivot_daily_field_history has an issue where there are two transactions on the same day one of which is null. The scenario is as follows.

An issue has two transactions on the same day changing the sprint. Its initially null then is set to an assigned sprint.
int_jira__daily_field_history changes Null values to the text 'is_null' case when field_value is null then 'is_null' else field_value end as field_value,
int_jira__pivot_daily_field_history chooses the max value for any sprints. The text 'is_null' is greater than the actual sprint id and is used.

I imagine this affects all other historical fields where there are two transactions on the same day. The code should be changed to select the last transaction on a day and use those values over others. Its not as simple as ignoring the nulls as those could be the last transaction in certain cases. This is the offending code block.

pivot_out as (

    -- pivot out default columns (status and sprint) and others specified in the var(issue_field_history_columns)
    -- only days on which a field value was actively changed will have a non-null value. the nulls will need to 
    -- be backfilled in the final jira__daily_issue_field_history model
    select 
        valid_starting_on, 
        issue_id,
        max(case when lower(field_id) = 'status' then field_value end) as status,
        max(case when lower(field_name) = 'sprint' then field_value end) as sprint

        ,
            max(case when lower(field_name) = 'due date' then field_value end) as due_date
        ,
            max(case when lower(field_name) = 'labels' then field_value end) as labels
        ,
            max(case when lower(field_name) = 'reporter' then field_value end) as reporter
        ,
            max(case when lower(field_name) = 'resolution' then field_value end) as resolution
        ,
            max(case when lower(field_name) = 'resolved' then field_value end) as resolved
        ,
            max(case when lower(field_name) = 'story points' then field_value end) as story_points
        ,
            max(case when lower(field_name) = 'team' then field_value end) as team
        from daily_field_history

    group by 1,2
),

Relevant error log or model output

No response

Expected behavior

The null value should be ignored if its an earlier transaction on that day. Only the last field_value should be selected.

dbt Project configurations

N/A

Package versions

N/A

What database are you using dbt with?

postgres

dbt Version

N/A

Additional Context

pivot_daily_field_history

Are you willing to open a PR to help address this issue?

[X] Yes.
[ ] Yes, but I will need assistance and will schedule time during our office hours for guidance
[ ] No.

fivetran-jamie commented 1 year ago

hey there, really appreciate your thoroughness! int_jira__daily_field_history, which the pivot model references, should be looking at only the last Sprint value for an issue for each day

i actually think the issue is that you have two different field ids for Sprint (customfield__10020 and customfield_10106), and the package partitions by the field_id instead of the field_name. Do you know what's up with that?

jon-goldberg commented 1 year ago

Our environment was customized over the years so I wouldn't be surprised if they changed a setting that resulted in two fields being used behind the scenes. With that said we only have one Sprint field I can see in the UI and its showing both of the toggled sprints.

Thanks for pointing me to that earlier table.

From a consistency perspective perhaps 'int_jira__daily_field_history' should be using the field name instead of the ID for the partition? Alternatively the pivot logic would need to be changed but I'd think these go hand in hand.

row_number() over ( partition by valid_starting_on, issue_id, field_id order by valid_starting_at desc ) as row_num

fivetran-jamie commented 1 year ago

i think for most most folks, partitioning by the field_id seems like the safer route 🤔

for your case however, i'd recommend overriding the int_jira__daily_field_history model to partition by field_name instead of the ID (you may also need to override downstream models that also partition by or join on field_id). we've got some instructions around how to do so here

jon-goldberg commented 1 year ago

Thanks Jamie! Your guidance on how the package was intended to work makes that easier. I'll make one last appeal to a broader fix but likely proceed to making a fix on our end.

I understand this situation might not occur regularly but it could happen to others. The risk is without rigorous testing it would go unnoticed and result in incorrect reporting. If it is discovered later than people will waste cycles trying to fix it.

As a data practitioner that subscribes to 'No Broken Windows' - this bothers me. There are multiple ways to solve the problem and I think this would ultimately be a fix that is relatively easy to implement now as we know about it while preventing future headache for others.

In any case- thank you for the support.

fivetran-jamie commented 1 year ago

Yeah that's very fair, as there have honestly been quite a few discussions around field_name vs field_id in Jira since we first released this package.

Next sprint (in a week), the team and I take some time to think this through. can't promise anything (except that we'll seriously explore it 🤠 ), but I want to find a solution that can address your (and probably others') issue without adding too much complexity to the package.

jon-goldberg commented 1 year ago

Hey @fivetran-jamie - Just curious where the team landed on this. Appreciate all of the support again. We'll likely be trying out of a few of the prebuilt transformation packages soon.

fivetran-jamie commented 1 year ago

hey hey -- so we will indeed be adding the option to use field_name instead of field_id partitions/joins! 🎉

this will probably be achieved through a jira_field_grain variable that you can set to field_name from your dbt_project.yml

Thanks again for pushing on this @jon-goldberg 🙌 In my investigations this past sprint, I realized that the package definitely assumes field names to be unique, which is recommended by Jira but clearly not always the case. I think this will be helpful to lots of folks!

Timeline-wise, I'll be working on this next sprint (starts tomorrow)

jon-goldberg commented 1 year ago

Wooo-hoo! Awesome. Thanks Jamie for the support 💪.

fivetran-jamie commented 1 year ago

hey hey @jon-goldberg -- i've got a working branch if you have time to test it out!

to do so, you'd replace the jira part of your packages.yml with

packages:
  - git: https://github.com/fivetran/dbt_jira.git
    revision: feature/switch-field-grain

and then in your dbt_project.yml, add the following (new) variable

vars:
  jira_field_grain: 'field_name' # default value is still field_id

FYI i am still recreating your use case (ie duplicate fields) in our own Jira sandbox environment, so would really appreciate you testing this out if you've got bandwidth!

jon-goldberg commented 1 year ago

@fivetran-jamie Apologies for the delay - we were pulled into some project related work. Philip and I just tested - this appears to have addressed the problems we faced before but it did cause a failing test for some of our duplicate fields.

I am guessing this is to be expected and we should remove the test. I spot checked a field of the problem issue ids and the final table matches what I see in Jira.

unique_int_jira__combine_field_histories_combined_history_id

Thank you so much for the help!

fivetran-jamie commented 1 year ago

no problem! oh huh that's interesting... we do use the combined_history_id as the unique_key for the incremental configuration, so we it should unique for incremental runs to be accurate and complete

currently, combined_history_id is a surrogate key hashed on var('jira_field_grain'),'issue_id', 'valid_starting_at'] but perhaps it NEEDS to include field_id regardless of your jira_field_grain value. i am thinking this because at this point in the DAG, we haven't rolled fields up to the name-grain yet (they're still at the ID-grain in this model)

cc @fivetran-joemarkiewicz

fivetran-jamie commented 1 year ago

@jon-goldberg could you try this out and see if everything still ties out (and if the test failure is removed)?

# packages.yml
packages:
  - git: https://github.com/fivetran/dbt_jira.git
    revision: releases/v0.13.latest

updated the branch FYI

fivetran-jamie commented 1 year ago

heads up we're planning to release these updates tomorrow if you @jon-goldberg (or anyone creeping on this thread 👁️ 👁️) have a moment today or tomorrow to retest 🌟

fivetran-joemarkiewicz commented 1 year ago

Hi All, the updates @fivetran-jamie highlighted above are now live in the latest version of the package (v.0.13.0)! Thank you all for being engaged in this discussion and helping to make a meaningful update to the package.

As such, I will close out this issue. Please feel free to reopen if the issue persists on your end. Thanks again!

jon-goldberg commented 1 year ago

Thanks Jamie and Joe! Philip didn't have time to update our codebase to use the new packages so I could test but he is planning to do so next week. Appreciate the support with this.

fivetran / dbt_jira