dogsheep / github-to-sqlite

Save data from GitHub to a SQLite database
https://github-to-sqlite.dogsheep.net/
Apache License 2.0
402 stars 43 forks source link

github-to-sqlite should handle rate limits better #51

Open simonw opened 3 years ago

simonw commented 3 years ago

From #50 - right now it will crash with an error of it hits the rate limit. Since the rate limit information (including reset time) is available in the headers it could automatically sleep and try again instead.

simonw commented 3 years ago

This just caused a failure in deploying the demo: https://github.com/dogsheep/github-to-sqlite/runs/1471304407?check_suite_focus=true

  File "/opt/hostedtoolcache/Python/3.8.6/x64/bin/github-to-sqlite", line 33, in <module>
    sys.exit(load_entry_point('github-to-sqlite', 'console_scripts', 'github-to-sqlite')())
  File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/github-to-sqlite/github-to-sqlite/github_to_sqlite/cli.py", line 142, in issue_comments
    for comment in utils.fetch_issue_comments(repo, token, issue):
  File "/home/runner/work/github-to-sqlite/github-to-sqlite/github_to_sqlite/utils.py", line 380, in fetch_issue_comments
    for comments in paginate(url, headers):
  File "/home/runner/work/github-to-sqlite/github-to-sqlite/github_to_sqlite/utils.py", line 472, in paginate
    raise GitHubError.from_response(response)
github_to_sqlite.utils.GitHubError: ('API rate limit exceeded for user ID 9599.', 403)
Error: Process completed with exit code 1.
daniel-butler commented 3 years ago

I don't have much experience with github's rate limiting. In my day job we use the tenacity library to handle http errors we get.

hydrosquall commented 2 years ago

I've been looking into how to to get this data out of Github (especially now there are "secondary rate limits" without an advertised allowance separate from the regular rate limits.

I've had decent success with the Airbyte github extractor (aside from one data quality issue https://github.com/airbytehq/airbyte/pull/15420 ). Airbyte splits data extraction between the GraphQL and REST endpoints depending on the resource type, but they're very comprehensive.

https://github.com/airbytehq/airbyte/blob/306a75ef5370728e0912cf52a1a898a530db0c90/airbyte-integrations/connectors/source-github/source_github/streams.py#L22-L122

Before this, I tried a few solutions in my own custom wrapper mentioned in this thread + its children https://github.com/PyGithub/PyGithub/issues/1989 , but they weren't working as expected.

chapmanjacobd commented 1 year ago

also, it says that authenticated requests have a much higher "rate limit". Unauthenticated requests only get 60 req/hour ?? seems more like a quota than a "rate limit" (although I guess that is semantic equivalence)

You would want to use x-ratelimit-reset

time.sleep(r['x-ratelimit-reset'] + 1 - time.time())

But a more complete solution would bring authenticated requests to the other subcommands. I'm surprised only github-to-sqlite get is using the --auth= CLI flag