andymeneely / chromium-history

Scripts and data related Chromium's history
11 stars 4 forks source link

Go over the MSR paper and add metrics that make sense #116

Closed andymeneely closed 10 years ago

andymeneely commented 10 years ago

Take a look at the given MSR paper and figure out what metrics make sense in the context of our project. What metrics should we use or improve upon?

dani5447 commented 10 years ago

Here are some metrics we could consider incorporating into our project:

Here are some metrics they had that might be more difficult for us to execute:

Additionally, I had another thought for a metric while reading. We could consider looking at % of file owners included in a code review. I also thought about looking at how much of a file is written by one person. (Might be related to author ownership metric?)

andymeneely commented 10 years ago

Be sure to also look at Table 3 of the paper because that shows what metrics they found not to be correlated and removed from analysis.

Code complexity: I published a paper years ago on this that showed that complex code was more likely to be vulnerable. Wasn't really the direction of this research, but as a control we can collect it. That will have to happen this summer.

Change entropy. I like this. I don't think it's that hard to collect. Given that it's in another paper it's worth reproducing.

Ownership - sure. The Major/Minor authorship metric is pretty common. Easy to collect. Again, better as a control but if we want to publish a paper on novel metrics this one's not going to be it.

Non-participatory changes - yes, definitely should do this one. We were already on this with participants/contributors. This is a much simpler one, but it's worth collecting.

Code review coverage - I'm thinking this won't make much sense to us here. With Chromium, all changes need to be reviewed, so the coverage will be over the whole system.

Amount of change between releases - We do have release data, and they release about once per month.

proportion of accepted patches - This is interesting and perhaps it's worth collecting the other 50% of data just for this. If code is being rejected, then we know people are at least paying attention

proportion of hastily reviewed changes - I'm thinking we need to do this. I like the way this metric was phrased. "If it's over 200 LOC/hour, then it's hastily reviewed, otherwise it's not".

The OWNERS data was too hard to collect earlier this year, that's why it's on ice for now. I'm thinking of revisiting it next year or this summer depending on how the next few weeks go.

andymeneely commented 10 years ago

I think we've got what we need out of this. Good to come back to this at some point too.