alphagov / govuk-knowledge-graph-gcp

GOV.UK content data and cloud infrastructure for the GovSearch app.
https://docs.data-community.publishing.service.gov.uk/tools/govgraph/
MIT License
8 stars 1 forks source link
govuk

GOV.UK Knowledge Graph

Documentation

Most documentation is in README.md files and docs directory in this repository. There is also GOV.UK Data Community Technical Documentation.

Data pipeline overview

  1. A workflow subscribes to notifications from the GOV.UK S3 Mirror that a new database backup of the Publishing API is available. The workflow creates an instance of a virtual machine.
  2. The virtual machine fetches the database backup file, extracts its data, and uploads that into BigQuery.
  3. Some SQL queries are scheduled to run daily, which call other SQL routines to refresh various tables from the newly uploaded data.

Access and permissions

People are granted access by membership of Google Groups. Other Google Cloud Platform projects are granted access via service accounts. Access is granted by editing each environment's tfvars file, such as terraform-dev/environment.auto.tfvars.

Google Groups

Tests

There are hardly any tests.

SQL

The most likely cause of an error in GovSearch queries is a change to the data and document schemas in the Publishing API.

It is difficult, in general, to test chains of SQL statements. DBT is popular for doing so, but adds a considerable abstraction, as well as requiring Python, which is discouraged in GOV.UK.

A scheduled query runs every hour, and raises an error if any tables have zero rows or have not been updated in the past 25 hours. The error is automatically detected in the logs, and an alert is raised, which sends an email to the govsearch-developers Google Group. Once the problem has been addressed, close the issue.

Ruby

Two of the BigQuery Remote Functions are implemented in Ruby and have unit tests. They are parse-html and html-to-text. Other BigQuery Remote Functions are somewhat trivial.

Maintainers

This project is maintained by the GOV.UK team, which is part of the Government Digital Service.

Common tasks

Import data from somewhere new

Look at https://github.com/alphagov/govuk-knowledge-graph-gcp/pull/594, which derives data from the Publisher app database and puts it into BigQuery.

Troubleshooting

Outdated or empty BigQuery tables

If GovSearch gives unexpected results, then the tables in BigQuery might not have been updated correctly. Usually that means a table either hasn't been updated at all within the last 24 hours, or it has been updated and is now empty. You can quickly check every table by querying a view called test.tables-metadata by writing a query like SELECT * FROM test.tables-metadata;. The table is checked automatically every hour, and if it finds old or empty tables then an 'incident' is created, and an email is sent to govgraph-developers@digital.cabinet-office.gov.uk.

Source data glitch

Check that the database backup files in the govuk-s3-mirror are the expected size (many gigabytes) by looking in the bucket.

Check that the Publishing API hasn't changed its schemas.

Other representations of GOV.UK content

There are several different representations of GOV.UK content, including:

None of these representations met a need for advanced searching and filtering for content designers, or a need for low-level structured data for developing data science applications. Hence the Knowledge Graph was developed.

Technical debt

See Technical debt.

Contributing

You are welcome to:

Licence

Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.