dogsheep / github-to-sqlite

Save data from GitHub to a SQLite database
https://github-to-sqlite.dogsheep.net/
Apache License 2.0
416 stars 43 forks source link
datasette datasette-io datasette-tool dogsheep github-api sqlite

github-to-sqlite

PyPI Changelog Tests License

Save data from GitHub to a SQLite database.

Demo

https://github-to-sqlite.dogsheep.net/ hosts a Datasette demo of a database created by running this tool against all of the repositories in the Dogsheep GitHub organization, plus the datasette and sqlite-utils repositories.

How to install

$ pip install github-to-sqlite

Authentication

Create a GitHub personal access token: https://github.com/settings/tokens

Run this command and paste in your new token:

$ github-to-sqlite auth

This will create a file called auth.json in your current directory containing the required value. To save the file at a different path or filename, use the --auth=myauth.json option.

As an alternative to using an auth.json file you can add your access token to an environment variable called GITHUB_TOKEN.

Fetching issues for a repository

The issues command retrieves all of the issues belonging to a specified repository.

$ github-to-sqlite issues github.db simonw/datasette

If an auth.json file is present it will use the token from that file. It works without authentication for public repositories but you should be aware that GitHub have strict IP-based rate limits for unauthenticated requests.

You can point to a different location of auth.json using -a:

$ github-to-sqlite issues github.db simonw/datasette -a /path/to/auth.json

You can use the --issue option one or more times to load specific issues:

$ github-to-sqlite issues github.db simonw/datasette --issue=1

Example: issues table

Fetching pull requests for a repository

While pull requests are a type of issue, you will get more information on pull requests by pulling them separately. For example, whether a pull request has been merged and when.

Following the API of issues, the pull-requests command retrieves all of the pull requests belonging to a specified repository.

$ github-to-sqlite pull-requests github.db simonw/datasette

You can use the --pull-request option one or more times to load specific pull request:

$ github-to-sqlite pull-requests github.db simonw/datasette --pull-request=81

Note that the merged_by column on the pull_requests table will only be populated for pull requests that are loaded using the --pull-request option - the GitHub API does not return this field for pull requests that are loaded in bulk.

You can load only pull requests in a certain state with the --state option:

$ github-to-sqlite pull-requests --state=open github.db simonw/datasette

Pull requests across an entire organization (or more than one) can be loaded with --org:

$ github-to-sqlite pull-requests --state=open --org=psf --org=python github.db

You can use a search query to find pull requests. Note that no more than 1000 will be loaded (this is a GitHub API limitation), and some data will be missing (base and head SHAs). When using searches, other filters are ignored; put all criteria into the search itself:

$ github-to-sqlite pull-requests --search='org:python defaultdict state:closed created:<2023-09-01' github.db

Example: pull_requests table

Fetching issue comments for a repository

The issue-comments command retrieves all of the comments on all of the issues in a repository.

It is recommended you run issues first, so that each imported comment can have a foreign key pointing to its issue.

$ github-to-sqlite issues github.db simonw/datasette
$ github-to-sqlite issue-comments github.db simonw/datasette

You can use the --issue option to only load comments for a specific issue within that repository, for example:

$ github-to-sqlite issue-comments github.db simonw/datasette --issue=1

Example: issue_comments table

Fetching commits for a repository

The commits command retrieves details of all of the commits for one or more repositories. It currently fetches the SHA, commit message and author and committer details; it does not retrieve the full commit body.

$ github-to-sqlite commits github.db simonw/datasette simonw/sqlite-utils

The command accepts one or more repositories.

By default it will stop as soon as it sees a commit that has previously been retrieved. You can force it to retrieve all commits (including those that have been previously inserted) using --all.

Example: commits table

Fetching releases for a repository

The releases command retrieves the releases for one or more repositories.

$ github-to-sqlite releases github.db simonw/datasette simonw/sqlite-utils

The command accepts one or more repositories.

Example: releases table

Fetching tags for a repository

The tags command retrieves all of the tags for one or more repositories.

$ github-to-sqlite tags github.db simonw/datasette simonw/sqlite-utils

Example: tags table

Fetching contributors to a repository

The contributors command retrieves details of all of the contributors for one or more repositories.

$ github-to-sqlite contributors github.db simonw/datasette simonw/sqlite-utils

The command accepts one or more repositories. It populates a contributors table, with foreign keys to repos and users and a contributions table listing the number of commits to that repository for each contributor.

Example: contributors table

Fetching repos belonging to a user or organization

The repos command fetches repos belonging to a user or organization.

Without any other arguments, this command will fetch all repos that the currently authenticated user owns, collaborates on or can access via one of their organizations:

$ github-to-sqlite repos github.db

To fetch repos belonging to a specific user or organization, provide their username as an argument:

$ github-to-sqlite repos github.db dogsheep # organization
$ github-to-sqlite repos github.db simonw # user

You can pass more than one username to fetch for multiple users or organizations at once:

$ github-to-sqlite repos github.db simonw dogsheep

Add the --readme option to save the README for the repo in a column called readme. Add --readme-html to save the HTML rendered version of the README into a column called readme_html.

Example: repos table

Fetching specific repositories

You can use -r with the repos command one or more times to fetch just specific repositories.

$ github-to-sqlite repos github.db -r simonw/datasette -r dogsheep/github-to-sqlite

Fetching repos that have been starred by a user

The starred command fetches the repos that have been starred by a user.

$ github-to-sqlite starred github.db simonw

If you are using an auth.json file you can omit the username to retrieve the starred repos for the authenticated user.

Example: stars table

Fetching users that have starred specific repos

The stargazers command fetches the users that have starred the specified repos.

$ github-to-sqlite stargazers github.db simonw/datasette dogsheep/github-to-sqlite

You can specify one or more repository using owner/repo syntax.

Users fetched using this command will be inserted into the users table. Many-to-many records showing which repository they starred will be added to the stars table.

Fetching GitHub Actions workflows

The workflows command fetches the YAML workflow configurations from each repository's .github/workflows directory and parses them to populate workflows, jobs and steps tables.

$ github-to-sqlite workflows github.db simonw/datasette dogsheep/github-to-sqlite

You can specify one or more repository using owner/repo syntax.

Example: workflows table, jobs table, steps table

Scraping dependents for a repository

The GitHub dependency graph can show other GitHub projects that depend on a specific repo, for example simonw/datasette/network/dependents.

This data is not yet available through the GitHub API. The scrape-dependents command scrapes those pages and uses the GitHub API to load full versions of the dependent repositories.

$ github-to-sqlite scrape-dependents github.db simonw/datasette

The command accepts one or more repositories.

Add -v for verbose output.

Example: dependents table

Fetching emojis

You can fetch a list of every emoji supported by GitHub using the emojis command:

$ github-to-sqlite emojis github.db

This will create a table called emojis with a primary key name and a url column.

If you add the --fetch option the command will also fetch the binary content of the images and place them in an image column:

$ github-to-sqlite emojis emojis.db -f
[########----------------------------]  397/1799   22%  00:03:43

You can then use the datasette-render-images plugin to browse them visually.

Example: emojis table

Making authenticated API calls

The github-to-sqlite get command provides a convenient shortcut for making authenticated calls to the API. Once you have created your auth.json file (or set a GITHUB_TOKEN environment variable) you can use it like this:

$ github-to-sqlite get https://api.github.com/gists

This will make an authenticated call to the URL you provide and pretty-print the resulting JSON to the console.

You can omit the https://api.github.com/ prefix, for example:

$ github-to-sqlite get /gists

Many GitHub APIs are paginated using the HTTP Link header. You can follow this pagination and output a list of all of the resulting items using --paginate:

$ github-to-sqlite get /users/simonw/repos --paginate

You can outline newline-delimited JSON for each item using --nl. This can be useful for streaming items into another tool.

$ github-to-sqlite get /users/simonw/repos --nl