chaoss / grimoirelab-perceval

Send Sir Perceval on a quest to retrieve and gather data from software repositories.
http://perceval.readthedocs.io/
GNU General Public License v3.0
290 stars 177 forks source link

[Question] Is there a perceval API for obtaining number of commits and contributors to a GitHub repo? #767

Closed nhasabni closed 2 years ago

nhasabni commented 2 years ago

GitHub backend for perceval returns issues, repository metadata, and pull requests, but does not have a specific API to get commits and contributors count. Am I expected to use fetch API here? Thanks.

jgbarah commented 2 years ago

For getting medatadata from the git repository in a GitHub repo, you can use the Perceval git backend. Or are you asking for something else?

nhasabni commented 2 years ago

For getting medatadata from the git repository in a GitHub repo, you can use the Perceval git backend. Or are you asking for something else?

I looked into Perceval's git backend. Will it fetch all the commits for a repository? I am looking for getting a count of the number of commits. It looks like GitHub's REST API for commits here is more efficient in a sense that it only fetches 30 (default) commits with a link to the last page. That allows me to determine the number of commits easily, without fetching all the commits.

jgbarah commented 2 years ago

If you only need the number of commits for a set of repos, the GitHub API is much more efficient, as you say. But in that case, you can also consider other solutions such as https://ghtorrent.org/

If you want (and talking in terms of the MSR hackathon) you could also add a call to that GitHub API to the Perceval github metadata backend to get richer data, or write a new backend, which is not that difficult.

nhasabni commented 2 years ago

If you only need the number of commits for a set of repos, the GitHub API is much more efficient, as you say. But in that case, you can also consider other solutions such as https://ghtorrent.org/

If you want (and talking in terms of the MSR hackathon) you could also add a call to that GitHub API to the Perceval github metadata backend to get richer data, or write a new backend, which is not that difficult.

Yes, I am using GitHub APIs now. I'm seeing a weird behavior when I use GitHubClient of github backend. Here is the code that I am trying:

github_client = GitHubClient(owner="chaoss", repository="grimoirelab-perceval", tokens=[my_token])
resource_url = <url_that_I_want>
response = github_client.fetch(url=resource_url)

For slightly different values of resource_url, I am seeing different values in response.header.

For resource_url = "https://api.github.com/repos/chaoss/grimoirelab-perceval/issues", I see that response.headers contain Link field which points to the number of pages. Value is 'Link': '<https://api.github.com/repositories/47415120/issues?page=2>; rel="next", <https://api.github.com/repositories/47415120/issues?page=3>; rel="last"'

While if I change resource_url slightly such as https://api.github.com/repos/chaoss/grimoirelab-perceval/issues?accept=application/vnd.github.v3+json&state=open&since=2021-06-04T23:10:45Z, I don't see Link field in response.header, although response.status is 200.

Can you guide me as to what could be going wrong? All the URLs work correctly in a browser.

Correct response is as below:

>>> resource_url = "https://api.github.com/repos/chaoss/grimoirelab-perceval/issues"
>>> response = github_client.fetch(url=resource_url)
>>> print(response.headers)
{'Server': 'GitHub.com', 'Date': 'Wed, 01 Dec 2021 23:20:37 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Cache-Control': 'private, max-age=60, s-maxage=60', 'Vary': 'Accept, Authorization, Cookie, X-GitHub-OTP, Accept-Encoding, Accept, X-Requested-With', 'ETag': 'W/"d2adea4cd688d0de59460af12e086c10ba1f43941f5b990b0a9630e298b6af36"', 'X-OAuth-Scopes': 'admin:enterprise, admin:gpg_key, admin:org, admin:org_hook, admin:public_key, admin:repo_hook, delete:packages, delete_repo, gist, notifications, repo, user, workflow, write:discussion, write:packages', 'X-Accepted-OAuth-Scopes': 'repo', 'github-authentication-token-expiration': '2022-01-14 16:12:49 UTC', 'X-GitHub-Media-Type': 'github.v3; param=squirrel-girl-preview', 'Link': '<https://api.github.com/repositories/47415120/issues?page=2>; rel="next", <https://api.github.com/repositories/47415120/issues?page=3>; rel="last"', 'X-RateLimit-Limit': '5000', 'X-RateLimit-Remaining': '4932', 'X-RateLimit-Reset': '1638401725', 'X-RateLimit-Used': '68', 'X-RateLimit-Resource': 'core', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'Content-Encoding': 'gzip', 'X-GitHub-Request-Id': 'E780:7E1D:12752A6:31B1DE1:61A80345'}
jgbarah commented 2 years ago

Just to understand the context, what are you exactly trying to do? For example, for getting issues since a certain date, you can use issues instead of fetch, check the detailed documentation for GitHubClient. Also, you have fetch_items for retrieving items with pagination. I'm not sure if those could be useful to you...

zhquan commented 2 years ago

Hi @nhasabni

Your second URL https://api.github.com/repos/chaoss/grimoirelab-perceval/issues?accept=application/vnd.github.v3+json&state=open&since=2021-06-04T23:10:45Z there is no link because there are only 15 items, GitHub API returns 30 items per page by default. You can try again but by adding per_page=10 and you will see the Link field on the response.headers

Also you can add since to payload as payload = {"since": "2021-06-04T23:10:45Z"} and you will get the same result.

>>> resource_url = "https://api.github.com/repos/chaoss/grimoirelab-perceval/issues"
>>> payload = {"since": "2021-06-04T23:10:45Z", "per_page": 10}
>>> response = github_client.fetch(url=resource_url, payload=payload)
>>> print(response.headers)
{'Server': 'GitHub.com', 'Date': 'Wed, 15 Dec 2021 11:31:24 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Vary': 'Accept, Accept-Encoding, Accept, X-Requested-With', 'ETag': 'W/"75b5808cf43d21a03628c8ae2c3d36909dce4defc97d8f5c8a9b3d4c6fd9984d"', 'X-GitHub-Media-Type': 'github.v3; param=squirrel-girl-preview', 'Link': '<https://api.github.com/repositories/47415120/issues?since=2021-06-04T23%3A10%3A45Z&per_page=10&page=2>; rel="next", <https://api.github.com/repositories/47415120/issues?since=2021-06-04T23%3A10%3A45Z&per_page=10&page=2>; rel="last"', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'Content-Encoding': 'gzip', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '47', 'X-RateLimit-Reset': '1639569249', 'X-RateLimit-Resource': 'core', 'X-RateLimit-Used': '13', 'Accept-Ranges': 'bytes', 'Transfer-Encoding': 'chunked', 'X-GitHub-Request-Id': 'DAC6:5E16:247AEE:27AFA2:61B9D20B'}

As @jgbarah said use fetch_items for retrieving items with pagination.

I hope it helps you.

Best, Quan

jgbarah commented 2 years ago

Thanks, @zhquan !!

nhasabni commented 2 years ago

@zhquan @jgbarah Thanks for response. Got that the default number of items per page (30) was making the difference!

jgbarah commented 2 years ago

Great! Can we close the issue then?

nhasabni commented 2 years ago

Yes.

vchrombie commented 2 years ago

Thanks for being a part of the discussion @nhasabni @jgbarah @zhquan Closing this issue since it is resolved. Feel free to open a new one incase of any doubt.