fastai / ghapi

A delightful and complete interface to GitHub's amazing API
https://ghapi.fast.ai/
Apache License 2.0
526 stars 57 forks source link

Pagination does not work for api.search.code #170

Closed mseeger closed 3 months ago

mseeger commented 5 months ago

When applied to search.code, the paged wrapper does not stop trying to retrieve extra pages. It has to be stopped manually by counting the number of results retrieved.

Here is some code using requests:

from ghapi.all import GhApi
from ghapi.page import paged
import os
import requests

token = os.environ["GITHUB_TOKEN"]

search_term = "SimulatorBackend"
query = f"{search_term} in:file repo:awslabs/syne-tune extension:py language:python"

# Get first page only (which should be the only one)
headers = {"Authorization": f"token {token}"}
for per_page in [30, 4]:
    print(f"\nper_page = {per_page}")
    response = requests.get(
        "https://api.github.com/search/code",
        params={'q': query, 'per_page': per_page},
        headers=headers,
    )
    json_data = response.json()
    print(f"total_count = {json_data['total_count']}, num_items = {len(json_data['items'])}")
    links = response.headers.get('Link')
    if links is None:
        print("'Link' not found in response.headers")
    else:
        print(f"links = {links}")

This gives:

per_page = 30
total_count = 7, num_items = 7
'Link' not found in response.headers

per_page = 4
total_count = 7, num_items = 4
links = <https://api.github.com/search/code?q=SimulatorBackend+in%3Afile+repo%3Aawslabs%2Fsyne-tune+extension%3Apy+language%3Apython&per_page=4&page=2>; rel="next", <https://api.github.com/search/code?q=SimulatorBackend+in%3Afile+repo%3Aawslabs%2Fsyne-tune+extension%3Apy+language%3Apython&per_page=4&page=2>; rel="last"

Pagination works fine: If the number of results per page is less than the total number of results, the header contains a link to a second page. Otherwise, it does not. I think the paged wrapper should recognize there are no further pages and stop.

And here is the ghapi code:

api = GhApi(token=token)

# Iterate over pages. This should stop after first page
print("\nRetrieving all pages...")
results = paged(api.search.code, q=query)
for page in results:
    print(f"total_count = {page['total_count']}")
    items = page.get('items')
    if items is None:
        print("No 'items' in page")
    else:
        print(f"num_items = {len(page['items'])}")

This gives:

Retrieving all pages...
total_count = 7
num_items = 7
total_count = 7
num_items = 0
total_count = 7
num_items = 0
total_count = 7
num_items = 0
total_count = 7
num_items = 0
total_count = 7
num_items = 0
total_count = 7
num_items = 0
total_count = 7
num_items = 0
total_count = 7
num_items = 0

It then breaks with an error:

---------------------------------------------------------------------------
HTTP403ForbiddenError                     Traceback (most recent call last)
Cell In[18], line 6
      4 print("\nRetrieving all pages...")
      5 results = paged(api.search.code, q=query)
----> 6 for page in results:
      7     print(f"total_count = {page['total_count']}")
      8     items = page.get('items')

File ~/venvs/datasci/lib/python3.9/site-packages/ghapi/page.py:16, in paged(oper, per_page, max_pages, *args, **kwargs)
     14 def paged(oper, *args, per_page=30, max_pages=9999, **kwargs):
     15     "Convert operation `oper(*args,**kwargs)` into an iterator"
---> 16     yield from itertools.takewhile(noop, (oper(*args, per_page=per_page, page=i, **kwargs) for i in range(1,max_pages+1)))

File ~/venvs/datasci/lib/python3.9/site-packages/ghapi/page.py:16, in <genexpr>(.0)
     14 def paged(oper, *args, per_page=30, max_pages=9999, **kwargs):
     15     "Convert operation `oper(*args,**kwargs)` into an iterator"
---> 16     yield from itertools.takewhile(noop, (oper(*args, per_page=per_page, page=i, **kwargs) for i in range(1,max_pages+1)))

File ~/venvs/datasci/lib/python3.9/site-packages/ghapi/core.py:62, in _GhVerb.__call__(self, headers, *args, **kwargs)
     59 kwargs = {k:v for k,v in kwargs.items() if v is not None}
     60 route_p,query_p,data_p = [{p:kwargs[p] for p in o if p in kwargs}
     61                          for o in (self.route_ps,self.params,d)]
---> 62 return self.client(self.path, self.verb, headers=headers, route=route_p, query=query_p, data=data_p)

File ~/venvs/datasci/lib/python3.9/site-packages/ghapi/core.py:121, in GhApi.__call__(self, path, verb, headers, route, query, data)
    119 return_json = ('json' in headers['Accept'])
    120 debug = self.debug if self.debug else print_summary if os.getenv('GHAPI_DEBUG') else None
--> 121 res,self.recv_hdrs = urlsend(path, verb, headers=headers or None, debug=debug, return_headers=True,
    122                              route=route or None, query=query or None, data=data or None, return_json=return_json)
    123 if 'X-RateLimit-Remaining' in self.recv_hdrs:
    124     newlim = self.recv_hdrs['X-RateLimit-Remaining']

File ~/venvs/datasci/lib/python3.9/site-packages/fastcore/net.py:218, in urlsend(url, verb, headers, route, query, data, json_data, return_json, return_headers, debug)
    215 if route and route.get('archive_format', None):
    216     return urlread(req, decode=False, return_json=False, return_headers=return_headers)
--> 218 return urlread(req, return_json=return_json, return_headers=return_headers)

File ~/venvs/datasci/lib/python3.9/site-packages/fastcore/net.py:119, in urlread(url, data, headers, decode, return_json, return_headers, timeout, **kwargs)
    117     with urlopen(url, data=data, headers=headers, timeout=timeout, **kwargs) as u: res,hdrs = u.read(),u.headers
    118 except HTTPError as e:
--> 119     if 400 <= e.code < 500: raise ExceptionsHTTP[e.code](e.url, e.hdrs, e.fp, msg=e.msg) from None
    120     else: raise
    122 if decode: res = res.decode()

HTTP403ForbiddenError: HTTP Error 403: Forbidden
====Error Body====
{
  "message": "API rate limit exceeded for user ID 6508962. If you reach out to GitHub Support for help, please include the request ID CF6C:3E975B:3E53186:3F197BF:65AA3948.",
  "documentation_url": "https://docs.github.com/rest/overview/rate-limits-for-the-rest-api"
}
mseeger commented 5 months ago

BTW: Is this project still maintained? The last release was in 2022, and there are 49 open issues with no comments on them at all.

I really like the autocomplete, etc., but if bugs are not fixed even if they are reported, I'd look for alternatives.

jph00 commented 3 months ago

There's some details in the docs about how to get the count and to use it. Feel free to send in a PR if you find it not working according to what's documented there -- be sure to at-mention me if you do.