epam / OSCI

Open Source Contributor Index
https://opensourceindex.io/
GNU General Public License v3.0
160 stars 95 forks source link

Improve identification of committers' organizations #24

Open irynastr opened 4 years ago

irynastr commented 4 years ago

The current OSCI implementation uses the email domain of the committer to identify their organization. Many developers do not use their company email address on GitHub, or do not make their email address public. However many of these people do include their organizational information in their GitHub user profiles.

We would like to improve the identification of committers organization using the data in their user profiles.

<<<>>> We already made an experiment to do this, but with minimal success. This is described below. The basic matching algorithm works like this:

1) The domain is selected from the commiter's email; 2) Each domain is compared with the list of company domains (google.com, microsoft.com, etc) regardless of case; 3) If no match is found, a regular expression analysis is performed for situations with domains of 3 and higher levels.

If after applying the basic algorithm the matching did not occur, an extended algorithm was proposed: 1) The profile information on the user's Github is uploaded; 2) The website field is taken from the profile, if it is empty, then go to step 3. Otherwise, the basic algorithm is applied on the specified domain. If no matches occurred after applying the basic algorithm, go to step 3. 3) The company field is taken and compared with the list of companies that we are processing. (Fuzzy band algorithms were used: Levenshtein distance, Sorenson-Dice coefficient, etc)

Result of experiment: For only 38% of all the users examined, we managed to match a company from their profile. The remaining profiles did not have a clear match. For milder match rules, only 5% is added.

It is also worth noting that for users where we managed to match their company from their profile, the company is the same as that received from the email in all cases.

Finally, this method of identifying company carries a large overhead. When implementing this approach, it will be necessary to download information for all users who made push events in 2020, their number (as of June 2020) is 5M - loading their profiles will take about 42 days calculating with GitHub API usage limits. We would also have to additionally load new profiles every day, and their download, in turn, may not fit into the daily usage limits.

octogonz commented 2 years ago

Including your real email address in public Git commits is a great way to invite spam. 🙁

Instead, the best practice is to commit using an anonymized address like 4673363+octogonz@users.noreply.github.com. GitHub itself uses anonymized addresses when it generates commits such as a PR merge.

octogonz commented 2 years ago

Finally, this method of identifying company carries a large overhead. When implementing this approach, it will be necessary to download information for all users who made push events in 2020, their number (as of June 2020) is 5M - loading their profiles will take about 42 days calculating with GitHub API usage limits.

This overhead seems worthwhile and a valuable service to the community. 👍👍 Perhaps GitHub would be willing to publish the aggregated data directly, if asked nicely.

We would also have to additionally load new profiles every day, and their download, in turn, may not fit into the daily usage limits.

Why? An annual ranking would be sufficient for most purposes. Certainly the ranking is not going to change substantially from day to day.

jeffwilcox commented 2 years ago

Regarding the volume of GitHub API calls - we use Conditional Requests and cache responses for so many things on GitHub. As long as someone's profile doesn't change, you aren't charged an API call, for example.

Might help make it more of a reality...

At our company, we ask employees to try and maintain a professional profile, and we internally allow them to choose to tell us who they are...

Hard problem to do at scale, for sure. Happy to help brainstorm.

anausa4eva commented 2 years ago

Hey Jeff,

Thanks for your comments! After much analysis, we concluded it was best to use the email address of the commit author to identify the organization to which they belong. Otherwise we loose almost 80% of made contributions, a lot of engineers don't note their companies in their profiles. However, we're really keen to explore your first idea about authenticating to both corporate system and GitHub account. Do you mean that employees make a commit and then authorize it in your corporate system?

Sealjay commented 1 year ago

I can confirm that we take the same approach at @Avanade - and whilst I love the OSCI tool, I worry that the contribution data could become inaccurate over time.

Taking a recent update to the companies list as an example - Release v2022.09.0 (#144) · epam/OSCI@cbf6b35 (github.com) - if we assume that James, Mohit, Guilherme and Justin work at Credera, Infosys, Farfetch, and ebay – then only Mohit’s contribution would have been associated with Infosys, as the others are contributing with personal email addresses or user.noreply emails.

We ask employees to use one GitHub account - and then complete the Organization field / be invited to the Avanade GitHub org. Employees log into both GitHub & our own corporate system.

That way, if someone moves to another organization, they can keep their private commit history, which we feel provides them with a good "CV" and we'd want to support people wherever they choose to go in their career.

This is a GitHub native feature too, if you use something like GitHub Enterprise - https://docs.github.com/en/enterprise-server@3.7/admin/user-management/managing-users-in-your-enterprise/viewing-people-in-your-enterprise