chaoss / augur

Python library and web service for Open Source Software Health and Sustainability metrics & data collection. You can find our documentation and new contributor information easily here: https://oss-augur.readthedocs.io/en/main/ and learn more about Augur at our website https://augurlabs.io
https://oss-augur.readthedocs.io/en/main/
MIT License
589 stars 845 forks source link

Numbers in /:owner/:repo/contributors not matching GitHub numbers #154

Closed OrkoHunter closed 6 years ago

OrkoHunter commented 6 years ago

Hi!

I am trying to use this API to get all the contributors of a project as well as the entire Twitter OSS. It is a very valuable metric!

However, as of now, the numbers given in the result does not seem to be accurate. For example,

  1. http://twitter.augurlabs.io/api/unstable/twitter/finagle/contributors tells us that cacoco has 1 commits, but https://github.com/twitter/finagle/graphs/contributors says that cacoco has 76 commits.

  2. http://twitter.augurlabs.io/api/unstable/twitter/finatra/contributors tells us that cacoco has 868 commits, but https://github.com/twitter/finatra/graphs/contributors says that cacoco has 462 commits.

My question is that, what does the commits data in the result represent?

ccarterlandis commented 6 years ago

I've been looking into this, and I'm not sure exactly what is going on here. I think it has something to do with GHTorrent, as opposed to us.

Looking at the database, for twitter/finagle (id 1372) and user cacoco (id 7875), running the following query

SELECT *
FROM commits
WHERE project_id = 1372
AND committer_id = 7875

should give us all of cacoco's commits to twitter/finagle. According to the GitHub API, there should be 76 of them. Instead, I get the following 2 results:

screen shot 2018-08-28 at 3 21 55 pm

When I try to look at both of these commits in the project_commits table (which the /contributors endpoint is using), for the commit with id 500482921 I get data back, but for the commit with id 500482928 I get nothing.

On top of this issue that this table doesn't seem to have all the data it should, it seems GHTorrent is only aware of 2 of cacoco's 76 commits. @sgoggins any ideas?

sgoggins commented 6 years ago

Note that I just restarted the Twitter instance from today's dev branch today.

ccarterlandis commented 6 years ago

I'm still seeing this issue where cacoco only has one commit. Could the GHTorrent database need updating?

sgoggins commented 6 years ago

@ccarterlandis : Try a hard refresh ... I am looking at this URL:

http://twitter.augurlabs.io/api/unstable/twitter/finagle/contributors

and I get a ton of data:

[{"name":"mosesn","user":132262,"commits":10.0,"issues":25.0,"commit_comments":7.0,"issue_comments":939.0,"pull_requests":0.0,"pull_request_comments":0.0,"total":981.0},{"name":"ICRILBRT","user":1256242,"commits":683.0,"issues":0.0,"commit_comments":0.0,"issue_comments":0.0,"pull_requests":0.0,"pull_request_comments":0.0,"total":683.0},{"name":"mariusaeriksen","user":64320,"commits":254.0,"issues":6.0,"commit_comments":0.0,"issue_comments":192.0,"pull_requests":0.0,"pull_request_comments":0.0,"total":452.0},{"name":"MOLLDYVS","user":11213753,"commits":283.0,"issues":0.0,"commit_comments":0.0,"issue_comments":0.0,"pull_requests":0.0,"pull_request_comments":0.0,"total":283.0},{"name":"QHPUWUNQ","user":14263062,"commits":216.0,"issues":0.0,"commit_comments":0.0,"issue..... (truncated for readability)

ccarterlandis commented 6 years ago

@sgoggins that's the data I'm getting as well; however, the issue persists. As you can see by this screencap, cacoco still appears to only have one commit. Based on the GitHub API, I don't think that's correct - for this repository, they should have at least 76. Is this a limitation / shortcoming of my knowledge about what data lies in GHTorrent?

screen shot 2018-08-28 at 3 51 21 pm
sgoggins commented 6 years ago

@ccarterlandis : I suspect this is an issue with different intentions behind each commit number. But I think we will work to get to the bottom of this shortly.

howderek commented 6 years ago

Hello @OrkoHunter,

I wanted to follow up on your feedback regarding inconsistent data between data sources. We will be implementing a new architecture in the coming months that will allow users to decide which data sources they prefer when multiple data sources can provide a metric. That way, users that care about historical data (for instance, to see commits that were overwritten with a git push --force or rebases) more than parity with the repository could use GHTorrent, while users that want to see one-to-one data with the repositories on GitHub can use the GitHub API.

Thank you again for your feedback!