Rewrite content extraction

Extracting content with Selenium proved to be horrifically slow, so this massive PR reimplements it with other technology. It grew into rather a beast.

Main changes

Use pandoc instead of selenium to extract plain text from HTML. Selenium would have been better because it uses real web browsers, so renders the plain text exactly as GOV.UK users are likely to see it. Nokogiri and similar HTML parsers such as BeautifulSoup are fast, but differ from browsers in the way they render tags such as <br> and <h1>. Pandoc is a compromise: fairly widely used (among technologists), and renders HTML for reading rather than for strict adherence to HTML semantics.
Nokogiri instead of selenium to extract various HTML tags such as <a> (hyperlinks) and <abbr> (abbreviations). Nokogiri does this task perfectly well. We had only been using Selenium for it because we were already parsing HTML to text, so might as well extract tags too.

Pipeline

This is implemented as a single query, and populates only two tables:

public.content contains a row for every "page" (where a part of a guide or travel_advice document is a page in its own right), its govspeak and HTML content, plain text extracted from the HTML, lines of text split from the plain text, and arrays of other tags found in the page such as hyperlinks and abbreviations.
public.content_new is the same as public.content, but contains only the rows that were created in the last batch.

An alternative would have been break the query up, and/or to persist some of the CTEs (common table expressions, i.e. intermediate steps) as tables in their own right. That would have complicated the query scheduling, and created clutter for most users, who aren't expected to need any of the intermediate tables.

What is extracted from page content

govspeak (if present)
HTML (derived from govspeak if not already present)
plain text (extracted from HTML)
individual lines of text (derived from plain text by splitting at newline characters)
hyperlinks (a distinct set of them)
abbreviations (a distinct set of them)

alphagov / govuk-knowledge-graph-gcp