davidtorosyan / wikimit

a wikipedia-to-git converter
MIT License
0 stars 1 forks source link

Design doc #2

Open davidtorosyan opened 3 years ago

davidtorosyan commented 3 years ago

Design doc for wikimit

Intro

Wikimit is an in-progress Wikipedia-to-git converter. This document details the approach.

Goals

  1. Input: wikipedia article, output: github repo organized under wikimit-hub
  2. Anonymous access (no account needed)
  3. Requesting conversion should be idempotently create/update a repo
  4. Low costs by using as-needed hosting and free storage

Architecture

Overview

Wikimit Design

Sequence (a):

  1. User chooses a Wikipedia page
  2. User submits URL in wikimit site
  3. Site sends ajax request
  4. Request handler adds URL to queue and returns the expected github URL
  5. Site polls github and shows link to user when ready

Sequence (b):

  1. Sync agent pulls URL off queue
  2. Sync agent queries Wikipedia for revision history
  3. Sync agent pushes commits to GitHub

wikimit.org

The wikimit site is a static webpage, hosted on GitHub pages with a custom domain (wikimit.org). The page has a textbox and a submit button.

Clicking the submit button sends an ajax request to the backend handler, which responds with a github URL. The client-side JS then polls that URL and alerts the user of two developments:

  1. When the repo is created
  2. When the repo is up to date with the latest revision

When (2) is reached, polling stops.

Request handler

The request handler is implemented as an AWS lambda. It validates the incoming Wikipedia URL, converts it to a GitHub repo URL, and places the pair on the job queue.

The handler will also:

  1. Make sure that the URL isn't already on the queue before putting it on
  2. Fail if the queue is full
  3. Throttle by IP address

Job queue

The job queue is a short-lived list of URLs that need to be processed.

TBD on the hosting.

Sync agent

The sync agent monitors the job queue and is only active when there's work to be done. Jobs can be done in parallel, but must take a lock on the queue when modifying it.

To do a job, the agent first creates a GitHub repo if one doesn't already exist. Otherwise, it clones the existing repo to its local filesystem. Then it fetches some number of revisions from Wikipedia, commits them, and pushes to GitHub. Then it (a) removes the job it just did from the queue, and (b) pushes a follow-up job to the end of the queue (if needed).

Follow-up jobs are needed if:

  1. There are more revisions to process.
  2. The job failed. In this case, the job is marked as being a retry.

TBD on the hosting.

Considerations

Security

Only the request handler and sync agent have access to the job queue (TBD on details).

The request handler throttles by IP (TBD on details).

The sync agent has access to a GitHub account with limited permissions, such that it can only push to the wikimit-hub project.

Reliability

Pages that have a huge number of revisions don't prevent other work from being processed because of the "do a small amount of work and create a follow-up job" approach the sync agent takes.

Transient failures are dealt with by retry jobs, and if they're persistent the job is dropped.

Alerting on:

  1. Queue is full
  2. Dropped jobs

Legal

Wikipedia content can be redistributed as long as it uses the CC-BY-SA license.

Wikipedia limits access to their API at two simultaneous requests per IP.

Performance

Initial testing shows that converting a page with 1000 revisions takes about 100 seconds, with 90% of time running "git commit". This is pretty slow but not catastrophic.

A small article with 1000 revisions takes about 6 MB of disk space. This should be manageable.

Both speed and size need to be stress tested with larger articles. TBD.

Costs

Hosting solutions:

  1. GitHub static web page (free)
  2. Request handler ($?)
  3. Job queue ($?)
  4. Sync agent ($?)
  5. Wikipedia API use (free)
  6. GitHub account use (free, but has some limits)
  7. Domain name ($10/yr)

TBD on actual cost estimates.

Open questions

See all the TBD items above.