andymeneely / chromium-history

Scripts and data related Chromium's history
11 stars 4 forks source link

Do manual investigation of non-central sheriffs and central non-sheriffs #244

Closed andymeneely closed 8 years ago

andymeneely commented 8 years ago

Let's try to understand why these are not correlated, so let's do some manual investigation on developers who are non-central and sheriffs, and developers who are central but not sheriffs. What kind of work do they do? What subsystems do they work on? Has their role changed over time? How active are they?

kbaumzie commented 8 years ago

Some of my previous findings on an interesting developer who is very central and has owned a lot of issues, performed code reviews, and is actively committing to other developers' code but was not a sheriff initially.

image

kbaumzie commented 8 years ago

Through my research, I have revealed that there is little to no correlation between who is a sheriff and what their contribution is to the issues. Many developers have been very active with owning, committing, and reviewing code that is not their own and, yet, are not sheriffs. Also, many developers deem themselves to be not so active at all and still maintain many sheriff hours. My new questions:

My only explanation of my findings from a literal aspect is that the sheriffs could maybe not be so central because they are more productive in code reviews or over seeing the activity from other developers to catch bugs. Otherwise, I have found that many developers just take initiative in being very active on issues and code and choose not to be sheriffs.

kbaumzie commented 8 years ago

When finding developers who are central and not sheriffs:

select * from developer_snapshots order by closeness desc;

select * from developer_snapshots order by closeness desc;

Once a developer has caught my attention and seems central but not a sheriff and vice versa, make the query:

select * from participants where dev_id = (#);

select distinct on (issue) dev_id, owner_id, issue, renew_date;

tesseradecades commented 8 years ago

When looking for interesting developers A developer who is not a sheriff, has low betweenness, and high closeness

select * from developer_snapshots where (has_sheriff_hrs=0 and betweenness < 0.01 and closeness > .5) order by closeness desc;

However the betweenness value used in this query doesn't appear to be realistically low, so I am working on making the query a function of the mean values of betweenness and closeness.

Developers 306 and 221 were originally thought to be interesting developers due to the above query, however further investigation, combined with the reveal that the betweenness value was unrealistic, debunked this theory.

Then once an interesting developer has been found, it helps to analyze their commit history select c.commit_hash, cf.filepath from commits c inner join commit_filepaths cf on (c.commit_hash=cf.commit_hash) where author_id= (#);

For example, it was interesting that developer 503 had what appeared to be high closeness and betweenness yet no sheriff hours (in spite of the values used in the original query). After examining their past commits, it seemed that they worked on a large number of subsystems starting partway through the project, and it was concluded that developer 503 may have been a somewhat senior developer prior to the project's launch.

kbaumzie commented 8 years ago

Question:

Do we have our metric recorded so that if a developer owns a file (i.e. own_count +) that they also obtain participation on that file (i.e. degree)?

select * from developer_snapshots where (own_count < 20 and degree > 100) order by degree desc;

I'm looking at a few developers that have a very high degree and low ownership on files and vice versa. I'm wondering if a developer could own a file but not have any participation correlated with that file. My assumptions are that if a developer owns a file, they have a high degree and if a developer has a high degree, they might/should have the ability to own files. Let me know if there is a flaw in my logic.

kbaumzie commented 8 years ago

Investigation:

I decided to take a closer look into the other attributes we calculate per developer. For example, the degree and ownership metrics raise interest in how these values are correlated to closeness, betweenness, and sheriff hours. I took a look at developers with a low ownership count, a high degree, and appearing to be central to their environment (i.e. above/around 0.5 closeness). The following developers are ones that I found to be interesting to this metric correlation:

To collect this information, I have been modifying Nathan's query above: SELECT * FROM developer_snapshots WHERE dev_id = (#); to understand the base metrics of the developer. SELECT c.commit_hash, cf.filepath FROM commits c INNER JOIN commit_filepaths cf ON (c.commit_hash=cf.commit_hash) WHERE author_id = (#); to see which filepaths the developer has contributed to. SELECT COUNT(c.commit_hash), COUNT(cf.filepath) FROM commits c INNER JOIN commit_filepaths cf ON (c.commit_hash=cf.commit_hash) WHERE author_id = (#); to see how many filepaths/subsystems the developer was contributing to.

kbaumzie commented 8 years ago

Documenting Further Investigation:

To further follow this investigation, I would like to look into how their centrality is correlated with degree/participation on flyers and if their centrality effects their amount of vulnerabilities.

Possible further research questions:

andymeneely commented 8 years ago

Are developers with high participation/degree more likely to have missed vulnerabilities?

Good. Keep this and get an answer. Used vuln_misses for this. Take a look at using Spearman's rank correlation coefficient. Figure out what these mean, and then report them here. Search for "spearman" in our code base see how we use it.

Are the active non-owners more likely to have vulnerabilities in the code they're working on?

Is time a contributing factor for developers contributing to code; does a shorter time frame yield more missed vulnerabilities, vice versa?

Metrics to think about: participation "bandwidth", how many code reviews in month. Looming deadlines isn't really a thing in this project, but maybe a high number of features? Use "enhancement" label on the bug_labels table.

"Busy developers", "Time pressures" are what we're capturing. Write up a new issue for this and we'll work on that.

Is a developer who is very central and holds many degrees of participation seem to have more or less missed vulnerabilities?

I think this is similar to your first question.

andymeneely commented 8 years ago

For this one:

Are the active non-owners more likely to have vulnerabilities in the code they're working on?

Let me think about the statistics on this.

andymeneely commented 8 years ago

How about looking at a correlation between missed vulnerabilities and ownership counts.

kbaumzie commented 8 years ago

In conclusion to this issue:

Throughout my research on this topic, I have found that there is minimal correlation between who is a sheriff and what their contribution is to the issues that they may or may not own. In early portions of my research on this issue, I was able to analyze the correlations between developers with sheriff hours and centrality values. Many developers have been very active with owning, committing, and reviewing code that is not their own and, yet, are not sheriffs. Also, many developers claim low activity levels and still maintain many sheriff hours in total.

In later portions of my research, I was able draw conclusions about the correlation (if any found) between sheriff hours and participation degree-- how active a developer is on an issue, etc. By looking further into our data, I was really able to understand and address the concepts that lead into our degree metric. Understanding that degree (participation value) was related to owning files, commits, being on code reviews, etc. made it easier for me to correlate the data with other metrics than the sheriff hours. Concluding this portion of my research, I have found that it is not valid in defining that a sheriff is central or holds sheriff hours based on their degree metric in our data.

This research has brought me to question a few more things about our developers and their degree. I will be conducting research to answer the question:

Are developers with high participation/degree more likely to have missed vulnerabilities?

I will be working with the vuln_missed metric and analyze how it is related with the Spearman's rank correlation coefficient. You can find more about this research topic here.