andymeneely / chromium-history

Scripts and data related Chromium's history
11 stars 4 forks source link

Metric: Collaborator familiarity #107

Closed andymeneely closed 10 years ago

andymeneely commented 10 years ago

Consider this scenario:

As of this code review we have Danielle, Chris, and Brian. I have worked with:

Ideas:

Need to get an idea of the spread: maximum entropy? std deviation? Max-Min?

Future work question: do we take into account the nature of those collaborations? LGTMs? Owners? Contributors? Participants?

Make this into a method of CodeReview. Aggregate it in rake run:stats.

smt9020 commented 10 years ago

Just trying to gather my thoughts on this issue:

andymeneely commented 10 years ago

Yes, all different methods.

Yes, dates are definitely a part of this. Use the "created" field in CodeReview for both sides of that query. For example, if Bob did a code review with Jane only in April, May, and June of 2011. Then the April review they would have a count of 0 (they had never worked together), then the May review would be 1, then for the June review would be 2. That count would then go into min, max, total, avg.

For the third one, consider this example:

Chris initiates a code review and asks Danielle and Brian to join. He has prior experience with both of them.
Brian <--15--> Chris <--10--> Danielle
Total familiarity for this code review: 25
Max fam for this code review: 15
Min fam for this code review: 10
Avg fam for this code review: 12.5

Now, one thing I am not taking into account are the cross-working of the reviewers. I'm just centering this on Chris. Who knows how many times Danielle and Brian have worked together directly? Should we take this into account too? How?

Note that I'm defining "worked with" as "co-reviewed", by the way. We could do contributors and participants as well, but let's get these metrics nailed down first.

andymeneely commented 10 years ago

By the way, we also need to think about how we aggregate this over a file. I'd be interested to see how Familiarity and Objectivity increase and decrease over time. Clearly, familiarity should increase if people keep reviewing the same work. But, I'm interested in seeing how often new people are integrated as well.

smt9020 commented 10 years ago

Most of my confusion I think is stemming from the fact that it would be centered around Chris. How would we choose which developer to center it on? I purpose this alternative:

Prior Experience:
Brian <--15--> Chris <--10--> Danielle <--2--> Brian
Brian: 17
Chris:  25
Danielle:  12
Total familiarity for this code review: 27
Max fam for this code review: 15
Min fam for this code review: 2
Avg fam for this code review: 9

Now lets add in a fourth person:

Brian <--15--> Chris 
Brian <--2--> Danielle
Brian <--5--> Shannon
Chris <--10--> Danielle 
Chris <--0--> Shannon
Danielle <--7--> Shannon
Brian: 22
Chris: 25
Danielle: 19
Shannon: 12
Total Familiarity: 39
Max: 15
Min: 0
Avg: 9.75

These examples both make sense to me. Does this seem like a good approach?

smt9020 commented 10 years ago
andymeneely commented 10 years ago

Ok, so I'm thinking that implementing this is going to result in some slow queries. The current total_familiarity method takes over 60 seconds to complete for a single code review. No way we can use that in person for analysis. So let's do this:

Add a field to the Participants table called reviews_with_owner. This field represents the number of prior reviews that a participant has had with the owner of the code review. Populate this field as a part of a Consolidator so that we can take advantage of our indexes.

That way we can just compute the metric by doing sum/max/avg/whatever of already-queried numbers.

We need a verify for this, don't forget!!

I'm thinking of doing something similar for #113 as well.

smt9020 commented 10 years ago

so I wrote this out to fill the reviews_with_owner column, please review. Here is what the reviews_with_owner metric represents:

At the time of this review, participant has reviewed reviews_with_owner number of owner's review. NOTE: This means that if they both were reviewers on someone else's review, that doesn't count.

Is that what we were envisioning?

I'm a little unsure about how to do a verify for this because in the development data there is no overlap for this (all the numbers are zero). thoughts?

andymeneely commented 10 years ago

Yes. If you're non-participatory then you basically don't count in my view.

Code LGTM. Thanks!

As for the verify, let's add to the development data with some co-participants with a current review we have. I'll take a look at production and get back to you with some code review ids.

andymeneely commented 10 years ago

Ok, I think I have an example. For the token code review 10854242 (already in our data set), palmer@chromium.org had agl@chromium.org as participant. In a prior code review 9415040, palmer had agl as a participant.

So download 9415040 and add a verify based on that. You'll have to update the other counts in other verifies as well.

smt9020 commented 10 years ago

To-Dos (for me)