Replace Broken Link Checker

btylerburton commented 1 year ago

User Story

In order to ensure the quality of our sites, datagovteam would like a reliable report on broken links.

Acceptance Criteria

[ ] GIVEN a list of pages
[ ] (optional) GIVEN a sitemap
[ ] (optional) GIVEN a list of sitemaps (ex. catalog) WHEN I run a scan THEN a list of dead URLs is reported
[ ] (optional) THEN a new issue is created in the datagov repo

Background

Datagov team uses a broken link checker currently for our static sites, but it's unreliable and consistently fails with false positives. The new link checker should, ideally, be configurable to ignore certain status codes, or a list of pages, and should produce a report that will be able to be "made green" in the near term so that a failing report can be made to fail the build. As it stands now the report is always failing, and not for valid reasons, so no triggers can be configured around its status.

Security Considerations (required)

Fixing old links will improve the quality of the site and the user experience, but will likely not address any security concerns related to any domains that have come into the possession of bad actors.

Sketch

[ ] Spike on the available options for link checkers
- There are a number of pages like this: https://medevel.com/os-broken-link-checkers-to-improve-your-seo/
[ ] Test reliability, configurability, activity of the repo
[ ] Implement link checker in Static Site QA Template
[ ] Implement dependency / config upgrades in static sites:
- [ ] https://github.com/GSA/datagov-11ty
- [ ] https://github.com/GSA/resources.data.gov
- [ ] https://github.com/GSA/data-strategy
- [ ] https://github.com/GSA/us-data-federation
- [ ] (optionally if supported) https://github.com/GSA/catalog.data.gov

Also related:

[ ] (optionally) Create an issue when broken links are reported https://github.com/GSA/data.gov/issues/2922

hkdctol commented 1 year ago

@btylerburton so might this one also be addressed by #4476 ?

btylerburton commented 1 year ago

Yes ideally @hkdctol

rshewitt commented 11 months ago

added new relic link crawler here

rshewitt commented 11 months ago

clicking a point in a location graph navigates to the list of links tested. there's a difference of tested links between htmlproofer and new relic. htmlproofer may be traversing more than we need?

rshewitt commented 11 months ago

notes on new relic link checker:

so far it seems like there's no way to filter/ignore status codes
- apparently a variety of status codes are monitored and should be visible in the resources page source & context
the link checker appears to focus entirely on anchor elements
- images aren't checked
- scripts aren't checked
the least frequent check is 1 day and the most is 5 minutes. currently, our link checker generally runs every 1-2 weeks.

htmlproofer currently checks:

links
images
scripts
html validation errors

update: looks like the new relic link checker can identify a variety of types

btylerburton commented 11 months ago

Can the link checker alert us to 404's? Can it post to Slack?

I just checked resources.data.gov and it shows no 404's but I know that's not the case as there's a few I confirmed from this run...

https://github.com/GSA/resources.data.gov/actions/runs/7006861012/job/19059663585

ex.

rshewitt commented 11 months ago

I assume it would be able to alert us to 404's but surprisingly none have occurred for resource yet in any of the 6 locations in the monitor.
looks like there's a hook to slack
i've noticed a discrepancy between what htmlproofer and new relic checks. so far neither of those links appear to be checked in new relic.

rshewitt commented 11 months ago

after upgrading htmlproofer from 3.x to 5.x to potentially address some issues the resources site produces 284 failures. this includes checks on links, images, scripts, and html validation. this is a considerable amount of failures and switching to another utility ( see examples in the link of alternatives in the sketch ) won't fix them. some examples of failures worth mentioning:

localhost links
- http://localhost:4000/resources/data-gov-open-data-howto/
insecure connections ( e.g. ERR_SSL_VERSION_OR_CIPHER_MISMATCH )
- https://viewer.nationalmap.gov/advanced-viewer/
anchor elements not containing a hyperlink reference ( html validation error )

rshewitt commented 11 months ago

summary of failures using htmlproofer with the following flags: ignore-status-codes \"301,302,401,403,429\" --checks='Links,Images,Scripts,Html' --no-check-external-hash --no-check-internal-hash --no-enforce-https

datagov-11ty
- 216 failures
resources.data.gov
- 284 failures
data-strategy
- 550 failures
us-data-federation
- 5 failures

rshewitt commented 11 months ago

pausing work on this until group discussion on how we want to proceed.

btylerburton commented 11 months ago

let's chat about this at sync. looks like you found some good flags to use. however, i do believe we should be tracking 4xx series as errors since that means they're not publicly accessible.

rshewitt commented 11 months ago

htmlproofer offers a --only-4xx flag

rshewitt commented 11 months ago

here's the errors for the 4 static sites. this is the raw data from the terminal so if it's best i format them let me know. I used these flags for the runs --checks='Links,Images,Scripts,Html' --only-4xx --no-enforce-https --allow-missing-href --ignore-urls '/localhost./'. the error count for these will differ slightly from what i reported before because i'm using different flags. I think the ones i've chosen this time make sense but i'm okay with changing them to whatever we want.

david-waltermire commented 7 months ago

In my prior role at NIST, we have had great success with using lychee to check links against a generated version of the site in CI. This workflow builds the site and runs linking checking on the generated sources. The workflow is setup to work with Hugo, but other static site generators can be easily configured.

I am considering setting up something like this for fedramp.gov and marketplace.fedramp.gov.

btylerburton commented 7 months ago

Thanks for the recommendation @david-waltermire! I also found that lychee has a github action as well, so even easier to road test than before: https://github.com/lycheeverse/lychee-action

btylerburton commented 4 months ago

This looks promising: https://github.com/marketplace/actions/check-links-with-linkcheck

GSA / data.gov