dogsheep / github-to-sqlite

Save data from GitHub to a SQLite database
https://github-to-sqlite.dogsheep.net/
Apache License 2.0
402 stars 43 forks source link

Command for retrieving dependents for a repo #34

Closed simonw closed 4 years ago

simonw commented 4 years ago

I really, really want to start grabbing this data: https://github.com/simonw/datasette/network/dependents

simonw commented 4 years ago

Unfortunately it's not available through any GitHub API - I managed to figure out how to get dependencies, but I need dependents. https://github.com/simonw/til/blob/master/github/dependencies-graphql-api.md

simonw commented 4 years ago

It looks like the only option is to scrape them. I'll do that and then replace with an API as soon as one becomes available.

simonw commented 4 years ago

Proposed command:

github-to-sqlite scrape-dependents github.db simonw/datasette

I'll pull full details of the scraped repos from the regular API. I'll also record when they were "first seen" by the command.

simonw commented 4 years ago

I think this is the neatest scraping pattern:

[a["href"].lstrip("/") for a in soup.select("a[data-hovercard-type=repository]")]
simonw commented 4 years ago

And to find the "Next" pagination link:

soup.select(".paginate-container")[0].find("a", text="Next")
simonw commented 4 years ago

Documentation: https://github.com/dogsheep/github-to-sqlite/blob/c9f48404481882e8b3af06f35e4801a80ac79ed6/README.md#scraping-dependents-for-a-repository