GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
611 stars 98 forks source link

Replace Broken Link Checker #4277

Open btylerburton opened 1 year ago

btylerburton commented 1 year ago

User Story

In order to ensure the quality of our sites, datagovteam would like a reliable report on broken links.

Acceptance Criteria

Background

Datagov team uses a broken link checker currently for our static sites, but it's unreliable and consistently fails with false positives. The new link checker should, ideally, be configurable to ignore certain status codes, or a list of pages, and should produce a report that will be able to be "made green" in the near term so that a failing report can be made to fail the build. As it stands now the report is always failing, and not for valid reasons, so no triggers can be configured around its status.

Security Considerations (required)

Fixing old links will improve the quality of the site and the user experience, but will likely not address any security concerns related to any domains that have come into the possession of bad actors.

Sketch

Also related:

hkdctol commented 1 year ago

@btylerburton so might this one also be addressed by #4476 ?

btylerburton commented 1 year ago

Yes ideally @hkdctol

rshewitt commented 10 months ago

added new relic link crawler here

rshewitt commented 10 months ago

clicking a point in a location graph navigates to the list of links tested. there's a difference of tested links between htmlproofer and new relic. htmlproofer may be traversing more than we need? image

Image

rshewitt commented 10 months ago

notes on new relic link checker:

htmlproofer currently checks:

update: looks like the new relic link checker can identify a variety of types

Image

btylerburton commented 10 months ago

Can the link checker alert us to 404's? Can it post to Slack?

I just checked resources.data.gov and it shows no 404's but I know that's not the case as there's a few I confirmed from this run...

https://github.com/GSA/resources.data.gov/actions/runs/7006861012/job/19059663585

ex.

rshewitt commented 10 months ago
rshewitt commented 10 months ago

after upgrading htmlproofer from 3.x to 5.x to potentially address some issues the resources site produces 284 failures. this includes checks on links, images, scripts, and html validation. this is a considerable amount of failures and switching to another utility ( see examples in the link of alternatives in the sketch ) won't fix them. some examples of failures worth mentioning:

rshewitt commented 10 months ago

summary of failures using htmlproofer with the following flags: ignore-status-codes \"301,302,401,403,429\" --checks='Links,Images,Scripts,Html' --no-check-external-hash --no-check-internal-hash --no-enforce-https

rshewitt commented 10 months ago

pausing work on this until group discussion on how we want to proceed.

btylerburton commented 10 months ago

let's chat about this at sync. looks like you found some good flags to use. however, i do believe we should be tracking 4xx series as errors since that means they're not publicly accessible.

rshewitt commented 10 months ago

htmlproofer offers a --only-4xx flag

rshewitt commented 10 months ago

here's the errors for the 4 static sites. this is the raw data from the terminal so if it's best i format them let me know. I used these flags for the runs --checks='Links,Images,Scripts,Html' --only-4xx --no-enforce-https --allow-missing-href --ignore-urls '/localhost./'. the error count for these will differ slightly from what i reported before because i'm using different flags. I think the ones i've chosen this time make sense but i'm okay with changing them to whatever we want.

david-waltermire commented 6 months ago

In my prior role at NIST, we have had great success with using lychee to check links against a generated version of the site in CI. This workflow builds the site and runs linking checking on the generated sources. The workflow is setup to work with Hugo, but other static site generators can be easily configured.

I am considering setting up something like this for fedramp.gov and marketplace.fedramp.gov.

btylerburton commented 6 months ago

Thanks for the recommendation @david-waltermire! I also found that lychee has a github action as well, so even easier to road test than before: https://github.com/lycheeverse/lychee-action

btylerburton commented 3 months ago

This looks promising: https://github.com/marketplace/actions/check-links-with-linkcheck