[feature request] Extend Istio dashboards to better measure process health

utako commented 5 years ago

I'm looking to add charts for Istio that track the following metrics so we can better assess project process health:

[x] Time it takes to triage an issue after it has been filed (assign someone)
[x] Time it takes to close an issue after it has been filed.
[x] Time it takes for the first comment on an issue (not the submitter)
[x] Time it takes between comments on an issue. (This is primarily used to track the rate of responses of P0 bugs)
[ ] Time it takes from an issue being fixed to an artifact being available for the customer which incorporates the fix (i.e. in a release).
[ ] Track how many issues are closed as fixed, duplicate, stale, or unknown.
[x] Time it takes to assign reviewers to a PR.
[x] Time it takes before the first human response to a PR being created.
[ ] The number of PRs experiencing a test flake (PRs with label: has-experienced-a-flake)
[ ] Time a PR is blocked waiting for feedback from the team.
[ ] Time a PR is blocked waiting for additional action from the submitter.
[ ] Reason why PR is closed (merged, stale, rejected)

Like I've said, I'm happy to help out and contribute with a working devstats deployment. Let me know what else I can provide to make this an easier process

lukaszgryglicki commented 5 years ago

cc @dankohn

lukaszgryglicki commented 5 years ago

Currently, the easiest way of deploying DevStats is by using Kubernetes and Helm. This example uses AWS EKS cluster, but we're working now on a bare metal Kubernetes deployment - it that case all you need would be a cluster made of at least 3 nodes, no matter what kind of nodes. if you want me to implement all those dashboards, I'll need detailed specs what exactly should every dashboard show.

lukaszgryglicki commented 5 years ago

This metric is already implemented:

Time it takes before the first human response to a PR being created. I'm starting working on the remaining ones, I'll be asking for clarifications/examples during the process. Now working on: Time it takes to triage an issue after it has been filed (assign someone).

lukaszgryglicki commented 5 years ago

So for the first metric - some issues are self-assigned by the author in the same second when they're created - my guess is that some users can do that and we would rather like to measure issues that are assigned later (probably by someone other or bot or any other process), so I would skip those auto-assigned issues when calculating time percentiles (I'll leave code to include that commented out so you can decide to restore this).

lukaszgryglicki commented 5 years ago

So, this is the first metric: Time it takes to triage an issue after it has been filed (assign someone), it includes documentation (at the bottom), you can choose all repositories or any of them. You can choose aggregating period: week, 7 Days moving average, month, etc. let me know if that is OK for you.

lukaszgryglicki commented 5 years ago

Now wip on this Time it takes to close an issue after it has been filed.

lukaszgryglicki commented 5 years ago

We already have a dashboard for Time it takes to close an issue after it has been filed. So I'll start Time it takes for the first comment on an issue (not the submitter).

lukaszgryglicki commented 5 years ago

Dashboard for Time it takes for the first comment on an issue (not the submitter) ready.

lukaszgryglicki commented 5 years ago

Now wip on a bit more complex: Time it takes between comments on an issue.

lukaszgryglicki commented 5 years ago

Time it takes between comments on an issue done.

Time it takes from an issue being fixed to an artifact being available for the customer which incorporates the fix (i.e. in a release) is not clear enough for me - need more info what exactly should I calculate.
Will now work on Time it takes to assign reviewers to a PR.

lukaszgryglicki commented 5 years ago

Time it takes to assign reviewers to a PR finished.

lukaszgryglicki commented 5 years ago

I'll research the remaining missing metrics, but ideally I wouldf need a more detailed description of what exactly should I calculate for them (please provide details for all non-finished ones).

lukaszgryglicki commented 5 years ago

@utako :

Time it takes from an issue being fixed to an artifact being available for the customer which incorporates the fix (i.e. in a release). Needs detail how to detect that, time from issue close event to what exactly?
Track how many issues are closed as fixed, duplicate, stale, or unknown - take closed issues and the how exactly detect that issue was fixed or duplicate or stale or unknown?
The number of PRs experiencing a test flake (PRs with label: has-experienced-a-flake) - almost clear, just count such PRs having a given label has-experienced-a-flake in a given time (with repository group drop-downs)?
Time a PR is blocked waiting for feedback from the team. - time from PR open to what exactly? How can we detect that PR received "feedback from the team" or that it didn't receive feedback?
Time a PR is blocked waiting for additional action from the submitter - how to detect that PR is blocked? How to detect that it is blocked due to missing action from the submitter?
Reason why PR is closed (merged, stale, rejected) - merged - OK, but how to detect closed+stale or closed+rejected? Need that info to proceed.

lukaszgryglicki commented 5 years ago

I'm starting work on The number of PRs experiencing a test flake (PRs with a label: has-experienced-a-flake) - will calculate the number of PRs opened in a given aggregation period that ever had has-experienced-a-flake label.

lukaszgryglicki commented 5 years ago

There is no such label as has-experienced-a-flake This is the full list of all Istio labels found across entire org, the only similar is flaky-test which I'll try to use instead:

 actions/merge-to-release-branch
 adapters
 api
 approved
 area/api management
 area/cli
 area/config
 area/enviroments
 area/environments
 area/networking
 area/networking/cni
 area/perf and scalability
 area/perf and scalibility
 area/policies and telemetry
 area/security
 area/security/aaa
 area/test and release
 area/user experience
 area/user question
 asks-to-engprod
 aspect
 attribute generation
 auth
 automated-release
 Bluemix
 broker
 bug
 build
 build-cop
 build & test infrastructure
 cla: human-approved
 cla-manual
 cla: no
 cla: yes
 cleanup
 closed/duplicate
 closed/wontfix
 close/not reproducible
 cloudfoundry
 Code Mauve
 code mauve/process
 config
 content
 core
 critical
 debuggability
 dependency-update
 dep-update
 dev productivity
 docs
 do-not-merge
 do-not-merge/hold
 do-not-merge/post-submit
 do-not-merge/release-note-label-needed
 do-not-merge/work-in-progress
 duplicate
 e2e & integration
 egress
 enhancement
 env/aws
 env/azure
 env/cloudfoundry
 env/gke
 environment-ansible
 env/knative
 Epic
 feature
 flag/untested
 flaky-test
 for 1.0
 GKE
 good first issue
 GUI (html/js/css)
 hackathon
 help wanted
 high-pri
 hold
 HTTP (L7) load balancing & routing
 Infra
 ingress
 ingress controller
 install
 internal-infra-bug
 invalid
 issue-moved-from-installer
 istio-auth
 istio-bug
 istioctl
 istio-networking
 istio_oncall
 kind/backport
 kind/blocking daily release
 kind/blocking release
 kind/bug
 kind/circleci
 kind/customer-issue
 kind/docs
 kind/fixit
 kind/important for release notes
 kind/need more info
 kind/prow
 kind/question
 kind/test-failure
 kind/testing gap
 kind/upgrade failure
 lgtm
 manager
 mesh-sync-agent
 minikube
 mixer
 mungegithub
 MVP
 needs-ok-to-test
 needs-rebase
 networking
 no stalebot
 ok-to-test
 oncall
 Openshift
 P0
 P1
 P2
 perf
 performance
 performance & scalability
 pilot
 platform
 platform adapters
 platform compatibility
 policies-and-telemetry
 #Post Submit/E2E are failing #Block PR merging #Only PRs related to fix this issue can be merged #Contact oncall for more info
 PostSubmit Failed/Contact Oncall
 priority/0
 priority/1
 priority/p0
 priority/p1
 production readiness
 prow
 proxy
 proxy agent
 proxy controller
 proxy injection & traffic capture
 question
 RBAC
 release-automation
 release-note
 release-note-action-required
 release-note-label-needed
 release-note-none
 retest-not-required-docs-only
 review/done
 routing
 routingrules
 security
 security-policy
 stale
 stale?
 steering-governance
 storage
 techdebt
 telemetry & logging
 test
 test-bug
 test-failure
 test-infra
 testing
 tracking
 translation-chinese
 usability
 ux
 UX (User)
 web site
 wg-config
 wg-environments
 wontfix

lukaszgryglicki commented 5 years ago

Also seems like this label was only applied to a PR once. It is applied to issues instead. I'm not proceeding on this dashboard because this label was only applied to 8 issues across 3+ years, so all charts will just be flat, please take a look at the query's SQL and possibly identify other labels instead or suggest a different approach. The remaining dashboard are not clear enough to start working on them. I'm waiting for feedback.

utako commented 5 years ago

Thanks for your patience @lukaszgryglicki and I appreciate you working on this.

You can choose aggregating period: week, 7 Days moving average, month, etc. let me know if that is OK for you.

This looks good.

Time it takes from an issue being fixed to an artifact being available for the customer which incorporates the fix (i.e. in a release). Needs detail how to detect that, time from issue close event to what exactly?

I expect this means time from issue close to the date of a release that includes an associated PR. Does the Github API actually give you enough information to track this?

Track how many issues are closed as fixed, duplicate, stale, or unknown - take closed issues and the how exactly detect that issue was fixed or duplicate or stale or unknown?

There is a stale label, but I'm in the process of asking about the others.

The number of PRs experiencing a test flake (PRs with label: has-experienced-a-flake) - almost clear, just count such PRs having a given label has-experienced-a-flake in a given time (with repository group drop-downs)? Yes. Hopefully we'll have added this label to more PRs in the future...

Time a PR is blocked waiting for feedback from the team. - time from PR open to what exactly? How can we detect that PR received "feedback from the team" or that it didn't receive feedback? We have a list of codeowners, but I don't think this is something we can actually automate as part fo the Github API... Would it be possible to note the time of the first comment by a repository member?

I've followed up in https://github.com/istio/istio/issues/13891 but will be waiting on @geeknoid to proceed.

lukaszgryglicki commented 5 years ago

OK, great so I'm waiting for the final feedback and then I'll get back to the remaining items - probably in September after my vacation (last week of August).

lukaszgryglicki commented 4 years ago

Closing due to inactivity, please reopen if needed.

cncf / devstats.archive

[feature request] Extend Istio dashboards to better measure process health #181