Crawl vs List - Githubissues

Thanks for your interest @mgifford . I don't think there's a manual to read, so you're forgiven, 😄

Each website builds it's own list of pages to crawl using a buildConfig file, located in the sites folder. Here's the one for SaultSteMarie.ca.

import { writeConfig } from "../../utils.js";

(async () => {

  await writeConfig([
    "https://saultstemarie.ca/",
    "https://saultstemarie.ca/Search.aspx?searchtext=parks",
    "https://saultstemarie.ca/webapps/meetingMinutes.asp?type=council",
    "https://saultstemarie.ca/webapps/corporateCalendar.asp?e=true",
    "https://saultstemarie.ca/webapps/parabusCalendar.asp",
    "https://saultstemarie.ca/webapps/parksAndPlaygrounds.asp"
  ], [
    "https://saultstemarie.ca/"
  ],
    "saultstemarie");
})();

So there's two sections of URLs. The first list are pages that may not appear when crawling the website. The second list are pages that should be crawled.

The depth of the crawl is defined in the global config file.

In the end, after combining the list of crawled URLs with the list in the build file, a random selection of URLs is picked, based on the limit set in the config file. This ensures that the GitHub action can complete before the time limit. For example, there are hundreds of pages on the City website. It takes too long to scan them all.

So to do a scan on SaultSteMarie.ca, after installing the project, two scripts are run.

npm run build:website:saultstemarie
npm run test:website:saultstemarie

The build script uses the config files to build a fresh lighthouserc.json file. The test script runs Lighthouse tests on that lighthouserc.json file using lighthouse-ci.

A GitHub Action runs for each website daily. It builds and tests.

If any page in the lighthouserc.json file doesn't meet the thresholds, the GitHub Action is marked as failed. I use badges from shields.io to show the results of the last run.

Does all that make sense?

cityssm / lighthouse-scans

Crawl vs List #240