This action checks all links on a website. It will detect broken links i.e. links that return HTTP Code 403, 404... Check the logs to see which links are broken and consequently cause this action to fail. In addition to the actual broken URL, the navigation path to that location is also displayed.
Based on this work: https://github.com/healeycodes/Broken-Link-Crawler
Demo: https://github.com/ScholliYT/devportfolio/actions?query=workflow%3A%22Site+Reliability%22
website_url
Required The URL of the website to check. You can provide a comma separated list of URLs if you would like to crawl multiple websites.
include_url_prefix
Optional Comma separated list of URL prefixes to include. You may only want to crawl URLs that use "https://".
exclude_url_prefix
Optional Comma separated list of URL prefixes to exclude (default mailto:,tel:). Some sites do not respond properly to bots, and you might want to exclude those known sites to prevent a failed build.
include_url_suffix
Optional Comma separated list of URL suffixes to include. You may only want to crawl URLs that end with ".html".
exclude_url_suffix
Optional Comma separated list of URL suffixes to exclude. You may want to skip images by using ".jpg,.gif,.png".
include_url_contained
Optional Comma separated list of URL substrings to include. You may only want to crawl URLs that are hosted on your primary domain, so you could use "mydomain.com", which would include all of your subdomains, regardless the protocol you use.
exclude_url_contained
Optional Comma separated list of URL substrings to exclude. You may want to skip specific domains for external links that you do not want to fetch from, such as "knownadprovider.com".
web_agent_string
Optional The string to use for the web agent when crawling pages.
The default is:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
verbose
Optional Set logging verbosity level: true/false/yes/no/on/off/debug/info/warning/error/critical (default false).
max_retry_time
Optional Maximum seconds for request retries (default 30).
max_retries
Optional Maximum request retry count (default 4).
max_depth
Optional Maximum levels deep to check, -1 = unlimited depth (default -1).
always_get_onsite
Optional By default, the crawler will use a HEAD request first, then will follow up with a GET request if the response is HTML and the URL represents a page on the same site as the root URL. By setting this value to 'true', the crawler will always use a GET request for any onsite URL. (default false).
connect_limit_per_host
Optional By default, the crawler will open a maximum of 10 connections per host. This can be useful for when crawling a site that has rate limits. Setting this value to zero will cause an unlimited number of connections per host, but this could inadvertently cause timeout errors if the target server gets overwhelmed with connections. (default 10).
search_attrs
Optional The names of HTML element attributes to extract links from. This can be useful if you are crawling a site that uses a library like lazyload to lazy-load images -- you would want to make your search_attrs 'href,src,data-src'. (default 'href,src')
resolve_before_filtering
Optional By default, the crawler will apply the includes/excludes filtering criteria to the links as they appear in the HTML source. For example, if a link has a relative URL in the HTML source, then the includes/excludes will be applied to the link in its relative form. By setting this value to true, the crawler will fully resolve the link to its absolute representation before applying the includes/excludes filtering criteria. If you wanted to only crawl links that are prefixed with your site ('http://mysite.com/') then you would set resolve_before_filtering
to 'true'
and set include_url_prefix
to 'http://mysite.com/'
. (default false)
name: Check for broken links
on:
workflow_dispatch: # allow manual trigger
push:
branches:
- main
schedule:
# run daily at 4 am
# * is a special character in YAML so you have to quote this string
- cron: '0 4 * * *'
jobs:
crawl_for_broken_links:
runs-on: ubuntu-latest
name: Broken-Links-Crawler
steps:
- name: Checking links
uses: ScholliYT/Broken-Links-Crawler-Action@v3
with:
website_url: 'https://github.com/ScholliYT/Broken-Links-Crawler-Action'
exclude_url_prefix: 'mailto:,https://www.linkedin.com,https://linkedin.com'
verbose: 'true'
max_retry_time: 30
max_retries: 5
max_depth: 1
# jobs, steps... omitted in this example
uses: ScholliYT/Broken-Links-Crawler-Action@v3
with:
website_url: 'https://github.com/ScholliYT/Broken-Links-Crawler-Action'
include_url_prefix: 'https://github.com/ScholliYT/Broken-Links-Crawler-Action'
resolve_before_filtering: 'true'
verbose: 'true'
max_retry_time: 30
max_retries: 5
max_depth: 1
The easiest way to run this action locally is to use Docker. Just build a new image and pass the correct environment variables to it.
docker build --tag broken-links-crawler-action:latest .
docker run -e INPUT_WEBSITE_URL="https://github.com/ScholliYT/Broken-Links-Crawler-Action" -e INPUT_VERBOSE="true" -e INPUT_MAX_RETRY_TIME=30 -e INPUT_MAX_RETRIES=5 -e INPUT_MAX_DEPTH=1 -e INPUT_INCLUDE_URL_CONTAINED='ScholliYT/Broken-Links-Crawler-Action' -e INPUT_EXCLUDE_URL_CONTAINED='/tree,/code_menu_contents,/hovercards/citation' -e INPUT_EXCLUDE_URL_PREFIX="mailto:,https://www.linkedin.com,https://linkedin.com" broken-links-crawler-action:latest
Several utilities are used for testing the code out. We use Poetry (version 1.7.1) to manage dependencies. To install all the required dependencies, please use the following command:
poetry env use python3.11 # in case your system python differs from 3.8, 3.9, 3.10 or 3.11
poetry install # use '--without dev' to install only required dependencies
To execute quality checks automatically during a commit, make sure that the git hook script is set up by running the following command:
poetry run pre-commit install
You can run the full suite of automated tests on your local machine by using the "act" tool to simulate the GitHub workflow action that executes during commit, more information here:
https://github.com/nektos/act
You can run linting and code formatting checks using the flake8
command:
poetry run flake8 . --count --show-source --max-complexity=10 --statistics
You can run type linting mypy
command:
poetry run mypy
To run the unit tests, use the pytest
command like this:
poetry run pytest -m "not integrationtest" --cov=deadseeker --cov-fail-under=95 --cov-branch --cov-report=term-missing
To generate an HTML report that visually displays uncovered lines, use this version:
poetry run pytest -m "not integrationtest" --cov=deadseeker --cov-fail-under=95 --cov-branch --cov-report=term-missing --cov-report=html
To run the unit tests, use the pytest
command like this:
poetry run pytest -m "integrationtest"
Mutation testing changes the real code (creating a 'mutant') and runs all the tests to make sure that at least one test fails. This ensures that your tests are actually effective at testing the code, or it can also reveal unnecessary implementation code that should be refactored. To run mutation testing, use the poetry run mutmut run
command. For details on missed mutants, run the poetry run mutmut html
command to generate an HTML report of the missed mutants.