alphagov / link-checker-api

Checks links on GOV.UK
https://docs.publishing.service.gov.uk/apps/link-checker-api.html
MIT License
7 stars 4 forks source link

New endpoint to accept a GOV URL and crawl it to find the links ? #473

Open guy-roberts opened 3 years ago

guy-roberts commented 3 years ago

For our largely static site, it would be useful to let the API code find the URLs by crawling from our home page.

This must have been considered before, I bet I am missing something. If I did a PR to do this, am I likely to meet any show stoppers ?

Also, are there any instances running that our DfE project could use rather than hosting it ourselves ?

thomasleese commented 3 years ago

:wave: This has been considered before, but the way we use the API on GOV.UK means that it's never been a requirement. Our publishing tools use this API to check the links of an individual document before it's published on GOV.UK, this means the API doesn't have access to see the new page, and therefore the publishing tool needs to extract the links itself and send them to the Link Checker API.

For your needs, it does sound like it would be useful to get Link Checker API to do the crawling. Thinking a bit how it would fit into the app could be a little tricky, as the expectation is that the API receives a set of links. However, I don't see any major technical blockers to getting the API to build a Batch itself by crawling an initial link.

I think the biggest problem is that it would be a feature that we wouldn't use on GOV.UK, so there would be an added maintenance cost to us which I'm not sure we'd be able to support. I don't see any reason not to raise a PR though, you could always use the API as a forked version of ours containing the feature you need.

In terms of a live API, unfortunately there isn't one available at the moment, as we run it privately within our infrastructure. It shouldn't be too difficult to run yourself as it's a standard Rails app, the difficult part might be getting it to work without our API key authentication.

guy-roberts commented 3 years ago

Thanks for your quick response. We might well do a PR then, because we need such a thing. It could be a new API endpoint that accepts a URL, checks that its gov, then crawls to find all of the links under it. From then on it would just use the existing code.

Our project is for the DfE, thanks.