Open nacnudus opened 1 year ago
Very poorly structured work on how to do this can be found in my repo here: https://github.com/alphagov/data-insights-sandbox/tree/main/hyperlink_tester
At the moment this pulls the links from gov.uk-knowledge-graph content embedded_links table and for each link returns:
I will slightly refine this to extract only the final item from the historic status codes and links so that it answers the question raised in the original issue.
A user stumbled on this problem.
I need to find all the mainstream pages that link to this page: https://www.gov.uk/guidance/visa-processing-times-applications-outside-the-uk I know there are at least 2, because I stumbled across them. When I use the links tab in govspeak to search for pages that link there, only 5 whitehall pages come up. No mainstream pages, even though I know these ones do link there: https://www.gov.uk/tier-1-investor/extend-your-visa https://www.gov.uk/global-talent
Those two mainstream pages link to https://www.gov.uk/guidance/visa-decision-waiting-times-applications-outside-the-uk, which redirects to https://www.gov.uk/guidance/visa-processing-times-applications-outside-the-uk, hence the user's expectation that GovSearch would include the pages in a search for ones that link to https://www.gov.uk/guidance/visa-processing-times-applications-outside-the-uk.
Trello
Suppose
/old-page
has been unpublished and redirected to/new page
. You want to find pages that link to/new-page
, and you would like pages that still link to/old-page
to appear in the search results.This could be done for GOV.UK redirects in a similar way to how we follow taxons up the hierarchy, with a
WITH RECURSIVE
SQL statement.For links to external sites, we'd have to visit the links to find out where they redirect to.