alphagov / govuk-knowledge-graph-gcp

GOV.UK content data and cloud infrastructure for the GovSearch app.
https://docs.data-community.publishing.service.gov.uk/tools/govgraph/
MIT License
10 stars 1 forks source link

Follow redirects #538

Open nacnudus opened 1 year ago

nacnudus commented 1 year ago

Trello

Suppose /old-page has been unpublished and redirected to /new page. You want to find pages that link to /new-page, and you would like pages that still link to /old-page to appear in the search results.

This could be done for GOV.UK redirects in a similar way to how we follow taxons up the hierarchy, with a WITH RECURSIVE SQL statement.

For links to external sites, we'd have to visit the links to find out where they redirect to.

hwrightson commented 1 year ago

Very poorly structured work on how to do this can be found in my repo here: https://github.com/alphagov/data-insights-sandbox/tree/main/hyperlink_tester

At the moment this pulls the links from gov.uk-knowledge-graph content embedded_links table and for each link returns:

  1. The link
  2. The link status code
  3. If it exists, a list of historic status codes, else null
  4. If it exists, a list of historic links, else null

I will slightly refine this to extract only the final item from the historic status codes and links so that it answers the question raised in the original issue.

nacnudus commented 9 months ago

A user stumbled on this problem.

I need to find all the mainstream pages that link to this page: https://www.gov.uk/guidance/visa-processing-times-applications-outside-the-uk I know there are at least 2, because I stumbled across them. When I use the links tab in govspeak to search for pages that link there, only 5 whitehall pages come up. No mainstream pages, even though I know these ones do link there: https://www.gov.uk/tier-1-investor/extend-your-visa https://www.gov.uk/global-talent

Those two mainstream pages link to https://www.gov.uk/guidance/visa-decision-waiting-times-applications-outside-the-uk, which redirects to https://www.gov.uk/guidance/visa-processing-times-applications-outside-the-uk, hence the user's expectation that GovSearch would include the pages in a search for ones that link to https://www.gov.uk/guidance/visa-processing-times-applications-outside-the-uk.