Extracting content with Selenium proved to be horrifically slow, so this massive
PR reimplements it with other technology. It grew into rather a beast.
Main changes
Use pandoc instead of selenium to extract plain text from HTML. Selenium would have been better because it uses real web browsers, so renders the plain text exactly as GOV.UK users are likely to see it. Nokogiri and similar HTML parsers such as BeautifulSoup are fast, but differ from browsers in the way they render tags such as <br> and <h1>. Pandoc is a compromise: fairly widely used (among technologists), and renders HTML for reading rather than for strict adherence to HTML semantics.
Nokogiri instead of selenium to extract various HTML tags such as <a> (hyperlinks) and <abbr> (abbreviations). Nokogiri does this task perfectly well. We had only been using Selenium for it because we were already parsing HTML to text, so might as well extract tags too.
Pipeline
This is implemented as a single query, and populates only two tables:
public.content contains a row for every "page" (where a part of a guide or travel_advice document is a page in its own right), its govspeak and HTML content, plain text extracted from the HTML, lines of text split from the plain text, and arrays of other tags found in the page such as hyperlinks and abbreviations.
public.content_new is the same as public.content, but contains only the rows that were created in the last batch.
An alternative would have been break the query up, and/or to persist some of the CTEs (common table expressions, i.e. intermediate steps) as tables in their own right. That would have complicated the query scheduling, and created clutter for most users, who aren't expected to need any of the intermediate tables.
What is extracted from page content
govspeak (if present)
HTML (derived from govspeak if not already present)
plain text (extracted from HTML)
individual lines of text (derived from plain text by splitting at newline characters)
Extracting content with Selenium proved to be horrifically slow, so this massive PR reimplements it with other technology. It grew into rather a beast.
Main changes
<br>
and<h1>
. Pandoc is a compromise: fairly widely used (among technologists), and renders HTML for reading rather than for strict adherence to HTML semantics.<a>
(hyperlinks) and<abbr>
(abbreviations). Nokogiri does this task perfectly well. We had only been using Selenium for it because we were already parsing HTML to text, so might as well extract tags too.Pipeline
This is implemented as a single query, and populates only two tables:
public.content
contains a row for every "page" (where a part of aguide
ortravel_advice
document is a page in its own right), its govspeak and HTML content, plain text extracted from the HTML, lines of text split from the plain text, and arrays of other tags found in the page such as hyperlinks and abbreviations.public.content_new
is the same aspublic.content
, but contains only the rows that were created in the last batch.An alternative would have been break the query up, and/or to persist some of the CTEs (common table expressions, i.e. intermediate steps) as tables in their own right. That would have complicated the query scheduling, and created clutter for most users, who aren't expected to need any of the intermediate tables.
What is extracted from page content