Maintainers have metrics on code review load, makeup, and responsiveness

BigLep commented 4 months ago

Done Criteria

Maintainers have public metrics they can look at to see:

PR response time (i.e., the duration between when @lotus-maintainers are notified until when they first reply)
non-draft PR resolution time (i.e., the duration between when a PR is opened as non-draft or marked as "ready-for-review" until when it is merged, closed, or downgraded-to-draft)
non-draft PRs opened vs. merged/closed/downgraded per period (e.g., day)
non-draft PR queue size at a given instant sampled at a constant period (e.g., day)

All of these metrics will have dimensionality on whether the author was a maintainer or not so we can see how we treat contributors outside the team.

Pointers to these metrics should be in the Lotus readme or docs.

Why Important

In 2024Q2, maintainters “felt” they did a better job with PR responsiveness, but let's have the metrics in place so we can “know” if it’s a problem or if we improve or regress.

User/Customer

Maintainers

Notes

We’ll ideally use off-the shelf options than create our own.

rjan90 commented 2 months ago

I started looking at OSSInsights as one possibility for getting these metrics. They have a lot of different off the shelf widgets that we can pull. But it is a bit unclear to me if they can give us exactly the metrics that we are looking for outlined in this issue PR response time, PRs opened vs. merged per period (e.g., day), PR queue size at a given instant sampled at a constant period (e.g., day) :

Example of their Pull Request Lifecycle Widget:

Pull Request Lifecycle of filecoin-project/lotus

Example of their Pull Request Completion Time last 28 days:

rvagg commented 2 months ago

This is a tough one to do and drive the incentives we want. I'd prefer to prioritise response time over other metrics because it's the closest thing to measuring good OSS project maintenance behaviour—you want to be responsive to contributors. The sad fact is that not all PRs can be merged, and it's not unreasonable for PRs to exist in a state of limbo for some period either. But we ought to have a clearer picture of what outcome we want to drive because that will decide the incentives we want to put in place. If it's "completion time" then we could hit that nail with a stale bot that closes issues that don't get commented on; but those can be pretty hostile to contributors.

BigLep commented 2 months ago

@rjan90 and @rvagg: I updated the issue description to be more specific about the metrics I think we should collect.

PR response time (i.e., the duration between when @lotus-maintainers are notified until when they first reply)

non-draft PR resolution time (i.e., the duration between when a PR is opened as non-draft or marked as "ready-for-review" until when it is merged, closed, or downgraded-to-draft)

non-draft PRs opened vs. merged/closed/downgraded per period (e.g., day)

non-draft PR queue size at a given instant sampled at a constant period (e.g., day)

I'm intentionally just focusing on non-draft PRs. I think that is the most important area to be responsive and drive to conclusion (i.e., merge, close, or downgrade to draft). For the timing metrics, I made it clear where the timer starts and when it ends.

I think the metrics I have their incentivize the behavior we want to see, but I'd welcome to being challenged here particularly around any loopholes or bad incentives.

I expect some custom development will be needed to make this happen (and I know I'd personally excited to do it, but will define the requirements more before getting to that point). I'm also assuming we won't game the systems with asinine bots that auto-close because maintainers were slow.

rvagg commented 2 months ago

1 and 2 are good, I'm less sure about 3 and 4 but seeing some numbers might help in making that case. We could probably come up with some graphql to get at these numbers I suppose.

rvagg commented 2 months ago

We’ll ideally use off-the shelf options than create our own.

Yeah, I know, but just because it was fun I had a go at looking what these metrics might look like. The first two are fairly easily achievable with the GitHub API so I put together a little program to collect and report on the last month's worth of data (well, 1 month from 5 days ago): https://github.com/rvagg/repo-health-metrics

The other two are trickier because you need to collect data on all historical PRs and their events to collate. But anyway, this is interesting and maybe there's some integration we could build with it - even a slack bot to report once a week on how we're going. Current output shows this:

https://github.com/filecoin-project/lotus/pull/12389 created by @Stebalien on 2024-08-14T18:25:06Z has had no official response
https://github.com/filecoin-project/lotus/pull/12384 created by @rvagg on 2024-08-14T05:49:36Z has had no official response
Average official response time: 20.9 hours
Average resolution time: 34.1 hours

Where "official" is either a maintainer interacted or the repo was closed or merged by any actor. (Best term I could come up with, but still doesn't seem ideal).

rvagg commented 2 months ago

Not suggesting we continue to build this out, but it does show that you can easily get some of this stuff without too much work and it could be easily integrated into some kind of bot - Slack, GitHub Actions, regular cron run to store in a DB for reporting on a dashboard somewhere. All without needing to hook in to some existing service.

The over-time and state-of-the-universe metrics become trickier, which is where a large service might come in handy.

BigLep commented 2 months ago

@rvagg - this is awesome! I'll have to hold back to not dive in more here, but I'll put down some thoughts now so I can get it out of my head and park it for a little :)

I'm less sure about 3 and 4 The reason why I think these are useful are:

non-draft PRs opened vs. merged/closed/downgraded per period (e.g., day): is there more coming in than we can currently handle (review backlog is growing) or are we burning through the review backlog?

non-draft PR queue size at a given instant sampled at a constant period (e.g., day): it keeps it front and center how big the backlog is. Culturally I don't think it's good that we have 39 open non-draft reviews. I assume multiple of these can be closed or switched back to draft, but kind of like "CI must be green before merge" I would want the signal to be "there is a non-draft PR open! Before I start a new thread, I need to do whatever I can to get that 'ball' across the line to get a 'score' (or call if it back if it's not ready or important)!"

We’ll ideally use off-the shelf options than create our own.

While I do hope this is possible, I can imagine it won't be. I have yet to see something the last few years where I think it answered the specific tight clear metric I wanted measured that doesn't require a lot of caveats. I do expect there will be a level of customization here to get what we want and it's great to see it's not going to be too hard to get the data. (Before picking up this workstream with earnest, I do plan to spend a couple of hours trying to see if there is anything off the shelf that does give the specific metrics we want, but I'm not expecting to find anything.)

Sketches of what I would want if building something custom

Persistent PR comment with stats

I'd love to have a bot make a persistent comment in each PR that states the stats (similar to how code coverage tools or github-mgmt have a persistent comment that is updated with the current state of the PR). I'm imaging something like:

* Contributor type: maintainer|non-maintainer
* Instant when marked ready for review: YYYY-MM-DDTHH:MM:SSZ
* Instant when first maintainer responded: YYYY-MM-DDTHH:MM:SSZ (link_to_comment)
* Maintainer first response duration: PnDTnHnMnS
* Instant when first maintainer approved: YYYY-MM-DDTHH:MM:SSZ (link_to_approval)
* Maintainer first approval duration: PnDTnHnMnS
* Instant when merged|closed: YYYY-MM-DDTHH:MM:SSZ
* First maintainer approval to closure duration: PnDTnHnMnS
* Ready for review to closure duration: PnDTnHnMnS

These values would get populated as more events happen over the lifecycle of the PR.

I like the idea of having the data in a persistent GitHub PR comment (in addition to a shared datastore a like a google sheet) because it puts the metrics top-of-mind and it makes it easy to spotcheck as part of normal flow if there are any accuracy issues (e.g., "what? Why does this tool think it took us a week for a maintainer respond when I responded in the first day?!")

Persisting to shared datastore

In addition, to keep it simple, I'd imagine persisting this to a shared DB (maybe just a google sheet for now) with one row per PR. Having it in a google sheet makes it super low-barrier to start making charts or pivot table to answer questions:

Monthly average "Maintainer first response duration"
Monthly average "Maintainer first approval duration"
Monthly average "First maintainer approval to closure duration"
Monthly average "Ready for review to closure duration"
Monthly non-draft PRs opened vs. merged/closed/downgraded

CODEOWNERS as source of truth for "who is a repo maintainer"

I think it would also be ideal if CODEOWNERS for a repo was used as the source of truth for who is a maintainer as that has has other benefits (e.g., ensure a CODEOWNER approves, ensure CODEOWNER gets notification on PR)

As we're discussing here, I'm excited to not try and boil the ocean, but to still get some public, meaningful metrics for us. This will be fun.

filecoin-project / lotus