apache / incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
https://devlake.apache.org/
Apache License 2.0
2.61k stars 530 forks source link

[Bug][Dora-Cahnge_lead_time_calculator] incorrect linking of commits to deployments when using azuredevops and webhook #7193

Closed wouldd closed 7 months ago

wouldd commented 8 months ago

Search before asking

What happened

I assume that the issue I'm seeing is somehow unique to me using the azuredevops_go plugin and the webhook for reporting deployments from octopus. However the 'bug' I've identified looks to me like it would effect everyone. So probably the real issue is somewhere else but all I can say is that for me the DORA metrics are all wildly out because it relates every commit that has ever been made against a repo to the most recent deploy. This means the cycle time calculation for each commit is showing as just the age of that commit until the most recent deploy. I believe the offending line is this one: https://github.com/apache/incubator-devlake/blob/cb69f35b5319f57dd3d6c8534121a6ca069cd175/backend/plugins/dora/tasks/change_lead_time_calculator.go#L215C2-L215C3 the commits_diffs table has 3 columns, commit_sha, old_commit_sha and new_commit_sha. For me I need the filter in this query to filter where the mergeSha = cd.new_commit_sha in order to work correctly. I've tested this in my setup and that change has the expected behaviour. However, I would obviously expect this to be broken for everyone doing DORA metrics at all, and it seems unlikely no one else noticed this. So I guess maybe the intention of that commits_diffs table is to have information in it differently and perhaps the azure plugin is populating it incorrectly?

What do you expect to happen

In my case I have lots of quite old 4+years minimum repos. but I've only just started feeding deploy events in. so I would expect only to have data for the relatively few new commits since this went into place. and I would expect the cycle time calculations for those to be right.

How to reproduce

configure a long standing repo with lots of historical commtis against it. create a webhook call for the most recent commit/commtis in teh repo and expect to see dora calculations for those commits only. However you should see that it links all of the old commits to the new deployment. Indeed when I run with a version in which I changed the above filter check then I get the behavior I expect. all the numbers make sense and the pr_cycle_time calculations show up numbers for only commits that have been reported as part of deployments

Anything else

Happens every time for all commits.

Version

master

Are you willing to submit PR?

Code of Conduct

Startrekzky commented 8 months ago

Hi @wouldd , I'm trying to understand what you have encountered. Correct me if I'm wrong:

  1. You collected commits and deployments from Azure DevOps Go plugin, not from other Git plugins or Azure DevOps python plugin.
  2. You found that DORA's change lead time was incorrect. The time was much longer than you expected.
  3. You found that the records in the commits_diff table were incorrect. All commits rather than new commits were associated with the latest deployments.
  4. You thought the third point might be a bug. This bug only impacts the data integrity of change lead time calculated from the raw data from the Azure DevOps Go plugin?
wouldd commented 8 months ago

@Startrekzky pretty much. 1) yes usig azure devops go plugin which is what is pulling commits/pr information. and using teh 'webhook' to post deployment information that includes the shas of the commits being deployed. 2) yes, change lead time had been calculated for all commits in a repo and the time calculated for all of them was essentiaially from their moment of commit until the date of the recent deployment (which in reality only deployed 3 commits) 3) the commits_diffs table seems to just be a relationship table between commits, I'm not super clear on what it represents, but yes the 'new' commit that was included in the deployment had an entyr in here linking it back to the first commit ever in the repository 4) as mentioned if I modify the query in the lead time calculator to filter on 'new commit sha' then I appear to get sensible values restricted to those commits that were actually part of my deploy.. The odd thing is that if hte query is just wrong and it really should be filtering on new_commit_sha then I would expect this bug to impact all users of any datasource and dora is just broken. But since I would assume it would be pretty obvious if it wasn't working for everyone, I must conclude that the shape of this table is not being populated by the azure plugin as expected by the system at large.

Startrekzky commented 8 months ago

Hi @wouldd This seems to be a problem related to the refdiff. I'm not sure if the refdiff calling logics are the same among all plugins. I'll forward it to @klesh to take a look.

klesh commented 8 months ago

I suspect that the merge_commit_sha might be wrongly picked. https://learn.microsoft.com/en-us/rest/api/azure/devops/git/pull-requests/get-pull-requests?view=azure-devops-rest-7.1&tabs=HTTP

image

Currently, both azuredevops plugins are using lastMergeCommit as the merge commit sha, but maybe it should be lastMergeTargetCommit ?

For Change Lead Time to be calculated correctly, the following requirements must be met:

  1. Deployment Commits (for a repo) could form a Linear History. It means any newer Deployment Commit should contain all older Deployment Commits
  2. pr.merge_commit_sha must be found in the Linear History

loop in @mr-ks

wouldd commented 8 months ago

@klesh that certainly sounds plausible. I'm trying to wrap my head around your point 1 about all deployment commits forming a linear history 'newer deployment commit should contain all older deployment commits' I'm not sure I understand why that would be desired? I would assume that for change cycle time you need 1) a deployment id with it's date of deployment, result, environment. 2) a set of commit shas that are linked to that deployment id as the vehichle that deployed them, 3) a lookup of all commits with their commit date and a link to the PR that merged them.

So you would process a deployment row, link it's deployment commits and join to their commit records and pr record. then calculate the various time deltas for commit->pr create->pr-merge->deployment to production and record that cycle tyime against those commits. So any commit that has no associated deployment would have no cycle time values (the vast majority of our commit history was deployed long ago and we won't be back porting that information just starting from now) dora calcs /boards would work on only those commits that have non-null values for cycle time I'm not sure I see in this why a deployment commit would be linked to older deployments at all?

klesh commented 8 months ago

@wouldd It is needed to determine which deployment a PR belongs to.

If I understand your assumption correctly, you are connecting the PR to deployment by commit dates? no, it won't work, because the deployment happened after a commit got committed not necessarily including the commit. For example, one may start committing to a feature branch for the next version while a deployment on the release branch for the current version is taking place.

wouldd commented 8 months ago

@klesh I don't think we need to relate things by date at all. I think you have a deployment record provided by the webhook where that record says 'these are teh commits that were deployed in this deployment' I think you have a PR record provided by the azure devops api that says 'these are the commits that were part of this PR' so from ad eployment you can select the commits it deployed and join them to the pr commits to find the pr(s) that they were merged in.

So between these you can link all the data required to caluclate cycle time. you know exactly which commits where in a deployment and can look up which PR those commits were associated with to reference those timings and look up what date that commit was originally made to calculate how long it took from that point to the point it was deployed.

Why would a deployment commit be linked to all previous deployment commits?

klesh commented 8 months ago

@klesh I don't think we need to relate things by date at all. I think you have a deployment record provided by the webhook where that record says 'these are teh commits that were deployed in this deployment' I think you have a PR

No, this is not the case. The webhook, or any other data sources would have only 1 commit sha for each repo per deployment stored in the cicd_deployment_commits table, so we have to calculate the commits between deployments, which is covered by https://github.com/apache/incubator-devlake/blob/49f276f558a2ea7124dfdc680bccbef27bf0b581/backend/plugins/refdiff/tasks/deployment_commit_diff_calculator.go

Let's assume that you have a release branch, you merge PR to it then deploy, at any point of time, you pick the latest commit of the branch, and search all its ancestors, you can find every single deployment commits up to the point, then you are guaranteed to have a correct PRs and deployments relationship. This is important, because one may merge a couple of PRs before a deployment.

Sorry, I shouldn't use the term Linear Commit History, it is not accurate, we just need all the deployment commits gets included in the history.

record provided by the azure devops api that says 'these are the commits that were part of this PR' so from ad eployment you can select the commits it deployed and join them to the pr commits to find the pr(s) that they were merged in.

That is the idea but not the whole story, in fact, commits of a PR morally don't end up in the deployment commit history., especially, when most projects tend to use Squash and Merge which no PR.commits would ever end up in the release branch. To address this, we use merge_commit_sha ( or squash_commit_sha or head_commit_sha) which would definitely appear in the commits history to represent if a PR is included in a deployment.

So between these you can link all the data required to caluclate cycle time. you know exactly which commits where in a deployment and can look up which PR those commits were associated with to reference those timings and look up what date that commit was originally made to calculate how long it took from that point to the point it was deployed.

Why would a deployment commit be linked to all previous deployment commits?

Let me know if I missed anything.

wouldd commented 7 months ago

@klesh Sorry for the slow response, I was on vacation for the last two weeks then getting ontop of other work stuff. I saw that you had put in a fix related to this issue and I tried applying that fix. however it does not help. the dora reports still calculates every commit ever agfainst my respository against the only deployment it has been told about. I suspect there is a fundamental presumption in this that this system will always have full information about every deployment from the start of the repository? however that is not going to be the case for us (or I assume most users). so we might have a 3 year old rep owith thousands of commits against it that were deployed historicaly hoever we have no way and no intention to try to tell devlake about those historical deployments. We just want information as it pertains to new deployments and the commits there. I saw you added a build url to the cicdbuild converter in your fix, which I assume was hoped would limit the scope of the queries to just things that had a populated build url? The other issue here is that our deployments are not happening from build pipelines. they're handled by a completely separate tool.(hence the use of the webhook)

unfortunately as it stands this makes the dora metrics unusable for us, and those would be the main thing we're interested in.

nicolavolpini commented 7 months ago

I might be encountering the same bug. If not, please feel free to delete this comment as I don't want to stray the discussion.

I created a project with a single Github repo, using the GitHub connection for commits and PRs. A webhook is used for deployments. The DORA Details - Lead Time for Changes board shows the correct PR averages but the 2. PR details table shows a ridiculously high change_lead_time (see screenshots). It appears to be associating all the PRs in the project to the same and currently only webhook commit (notice how the deployment_id is the same for all PRs).

List of PRs in board:

image

Correct averages: image

This is also visible in the project_pr_metric table, where all PRs match the same deployment_commit_id (in this case the webhook id).

I'm on 1.0.0b4.

Here's the sanitized CURL for the webhook:

      "deploymentCommits":[
        {
        "commit_sha":"<sha>",
        "repo_url":"https://github.com/<org>/<repo>"
        }
      ],
      "create_time":"2024-04-11T11:39:33+00:00",
      "start_time":"2024-04-11T11:39:33+00:00",
    "end_time":"2024-04-16T11:51:35+00:00",
    "result": "SUCCESS"
     }'

Thanks!

wouldd commented 7 months ago

@Startrekzky I'm curious about the characterisation of the bug priority you set being 'does not affect the functionality or isn't evident' from my point of view this renders one of the core dora metrics unusable. I would consider that to be a very evident impact on function and a key metric we would be interested in.

Startrekzky commented 7 months ago

Thank you for pointing this out, @wouldd . I mis-operated it. It's updated now.

klesh commented 7 months ago

I might be encountering the same bug. If not, please feel free to delete this comment as I don't want to stray the discussion.

I created a project with a single Github repo, using the GitHub connection for commits and PRs. A webhook is used for deployments. The DORA Details - Lead Time for Changes board shows the correct PR averages but the 2. PR details table shows a ridiculously high change_lead_time (see screenshots). It appears to be associating all the PRs in the project to the same and currently only webhook commit (notice how the deployment_id is the same for all PRs).

List of PRs in board: image

Correct averages: image

This is also visible in the project_pr_metric table, where all PRs match the same deployment_commit_id (in this case the webhook id).

I'm on 1.0.0b4.

Here's the sanitized CURL for the webhook:

      "deploymentCommits":[
        {
        "commit_sha":"<sha>",
        "repo_url":"https://github.com/<org>/<repo>"
        }
      ],
      "create_time":"2024-04-11T11:39:33+00:00",
      "start_time":"2024-04-11T11:39:33+00:00",
    "end_time":"2024-04-16T11:51:35+00:00",
    "result": "SUCCESS"
     }'

Thanks!

Was the deployment the earliest one? If so, it is expected.

klesh commented 7 months ago

I think the core of the problem is that the time ranges of deployments and PRs are not aligned. Maybe we can solve the problem by either of the following methods:

@wouldd @nicolavolpini

wouldd commented 7 months ago

@klesh the first option here is a complete non-starter. we have >1000 repos which in some cases have existed for over 10 years. we have no way to recreate the deployment history and no reason to do so. we're not intereested in looking back that far at metrics. I think this would be common to most real-world users, they won't be starting out with devlake in place, they will be adding it to probably quite mature teams with long histories which are incomplete. the system needs to be able to handle picking up from a postiion of only being told about new deployments even though there is a long commit history already.

I would say that the dora metrics needs to be restrcted to starting at the earliest commit that any deployment webhook call has explicitly linked to a deployment. This is why I though the sql in the linking logic query was simply wrong in my initial bug description.

I would assume @nicolavolpini is in a similar boat since those numbers showed 6 month lead times.

nicolavolpini commented 7 months ago

I think this would be common to most real-world users, they won't be starting out with devlake in place, they will be adding it to probably quite mature teams with long histories which are incomplete. the system needs to be able to handle picking up from a postiion of only being told about new deployments even though there is a long commit history already.

This is precisely the situation we are in.

In our case, we had to rely on Webhooks for deployments since our pipeline structure - although it uses Jenkins - makes it impossible to correlate Jenkins deployment SHAs to GitHub PR SHAs (we have an intermediate step where Jenkins is picking artifacts from a central repo, thus a different SHA is shown in the job).

On a side note: what I expected was for a deployment webhook to be bound only to the PR it relates to. This is at least what I assumed as a user approaching DevLake and DORA metrics for the first time. I think it's a natural assumption.

Startrekzky commented 7 months ago

Hi @nicolavolpini and @wouldd , @klesh and I have reached an agreement on the solution to this issue, which is not linking the old PRs to the first deployment pushed from webhook.

This solution will fix the issue you have faced. The con is that the very first deployment will not be linked to any PRs. So, there might be a very small number of PR's cycle time that can not be matched, but I think it's acceptable.

Does it make sense?

nicolavolpini commented 7 months ago

Hello @Startrekzky , thanks that would be a good compromise for me. Better than the wrong data for sure. I do not understand why the first PR would be missed but I trust your judgment on this :)

wouldd commented 7 months ago

@Startrekzky @nicolavolpini @klesh I assume this is just the easiest thing to do to scope the date range to be the first deployment notification for a repo thorough to 'now' but the first deployment notification would cut off all the actual commits/prs for that deployment which happen before it.

I agree with Nicola that it's certainly better than the current situation and would soon become irrelevant in the stats. so if that's the easiest route then i'm all for it. though I do wonder if you could just look up the date of the commits linked to the first deployment and use those to capture the first one?

Startrekzky commented 7 months ago

though I do wonder if you could just look up the date of the commits linked to the first deployment and use those to capture the first one?

Hi @wouldd , do you mean that even though we can't link all PRs merged (e.g. PR-1, PR-2, PR-3. PR-3 is the one that linked to the merge_commit that triggers deploy-1) between the first-pushed deployment (deploy-1) to DevLake and the earlier deployment that hasn't been pushed to DevLake, at least we can link PR-3 to deploy-1?

wouldd commented 7 months ago

right I guess I meant that where you said 'The con is that the very first deployment will not be linked to any PRs.' I thought that might be avoidable - in my case the deployment webhook is still being told which commits are associated and those commits can be associated to the pr they went through etc.

It's still not entirely clear to me what I am losing/missing by just updating that dal filter in the change_leadtime_calculator in the getDeploymentscommit method to be: dal.Where("pm.project_name = ? AND cd.new_commit_sha = ? AND dc.RESULT = ?", projectName, mergeSha, devops.RESULT_SUCCESS),

instead of cd.commit_sha.

it certainly appears to me that the stats start to make sense when I run that way image

Startrekzky commented 7 months ago

I thought that might be avoidable - in my case the deployment webhook is still being told which commits are associated and those commits can be associated to the pr they went through etc.

@wouldd I don't think it's entirely avoidable, but I agree that your proposal could associate the first deployment (pushed to DevLake) with ONE of the PRs merged between the first deployment and its prior deployment.

We link deployments to PR via the medium 'newly-added-commits-by-d1', but we don't know all of these commits, we only know the merge commit sha and what you were trying to propose was to find the PR of the merge commit sha.

If your use case is to trigger a deployment every time a PR gets merged, it will work, because every deployment only gets one PR deployed. However, not all users use this practice. For those who do not, d1 will only be linked to part of the PRs, and this is the cons I said.

I hope it makes sense.

abeizn commented 7 months ago

Already effective in v1.0.0-beta6. @nicolavolpini @wouldd

nicolavolpini commented 7 months ago

Thanks! This looks much better.

One thing: while the metrics look correct, the DORA Details - Lead Time for Changes board still shows several PRs associated to the same deployment webhook:

image

How come? Thanks

Startrekzky commented 6 months ago

Hi @nicolavolpini , is this the first deployment webhook you posted via webhook? Did this deployment deploy all the commits merged by those associated PRs? If so, the logic might be correct.

Also, please mention me when commenting, otherwise I won't be notified.