Go over the MSR paper and add metrics that make sense

andymeneely commented 10 years ago

Take a look at the given MSR paper and figure out what metrics make sense in the context of our project. What metrics should we use or improve upon?

dani5447 commented 10 years ago

Here are some metrics we could consider incorporating into our project:

code complexity - they used McCabe's cyclomatic complexity to measure complexity of a filesystem. We could consider using this to see if a more complex filesystem has more vulnerabilities?
change entropy - measures how much a change (patch?) is spread over multiple files
author ownership - % commits that the most active contributor to a component has made
proportion of changes without discussion

Here are some metrics they had that might be more difficult for us to execute:

code review coverage - this was a bit vague in their paper for me
amount of change between releases - do we even have release data?
proportion of accepted patches - we don't have information on declined patches as of now
proportion of hastily reviewed changes - what @smt9020 had mentioned at one point earlier

Additionally, I had another thought for a metric while reading. We could consider looking at % of file owners included in a code review. I also thought about looking at how much of a file is written by one person. (Might be related to author ownership metric?)

andymeneely commented 10 years ago

Be sure to also look at Table 3 of the paper because that shows what metrics they found not to be correlated and removed from analysis.

Code complexity: I published a paper years ago on this that showed that complex code was more likely to be vulnerable. Wasn't really the direction of this research, but as a control we can collect it. That will have to happen this summer.

Change entropy. I like this. I don't think it's that hard to collect. Given that it's in another paper it's worth reproducing.

Ownership - sure. The Major/Minor authorship metric is pretty common. Easy to collect. Again, better as a control but if we want to publish a paper on novel metrics this one's not going to be it.

Non-participatory changes - yes, definitely should do this one. We were already on this with participants/contributors. This is a much simpler one, but it's worth collecting.

Code review coverage - I'm thinking this won't make much sense to us here. With Chromium, all changes need to be reviewed, so the coverage will be over the whole system.

Amount of change between releases - We do have release data, and they release about once per month.

proportion of accepted patches - This is interesting and perhaps it's worth collecting the other 50% of data just for this. If code is being rejected, then we know people are at least paying attention

proportion of hastily reviewed changes - I'm thinking we need to do this. I like the way this metric was phrased. "If it's over 200 LOC/hour, then it's hastily reviewed, otherwise it's not".

The OWNERS data was too hard to collect earlier this year, that's why it's on ice for now. I'm thinking of revisiting it next year or this summer depending on how the next few weeks go.

andymeneely commented 10 years ago

I think we've got what we need out of this. Good to come back to this at some point too.

andymeneely / chromium-history

Go over the MSR paper and add metrics that make sense #116