alphagov / govuk-knowledge-graph-gcp

GOV.UK content data and cloud infrastructure for the GovSearch app.
https://docs.data-community.publishing.service.gov.uk/tools/govgraph/
MIT License
8 stars 1 forks source link

Rewrite content extraction #641

Closed nacnudus closed 4 months ago

nacnudus commented 4 months ago

Extracting content with Selenium proved to be horrifically slow, so this massive PR reimplements it with other technology. It grew into rather a beast.

Main changes

Pipeline

This is implemented as a single query, and populates only two tables:

An alternative would have been break the query up, and/or to persist some of the CTEs (common table expressions, i.e. intermediate steps) as tables in their own right. That would have complicated the query scheduling, and created clutter for most users, who aren't expected to need any of the intermediate tables.

What is extracted from page content