Open anarcat opened 4 years ago
Thanks @anarcat would you like to open a PR to fix the links?
i'm kind of busy right now, and this was more a meta-issue than those specific two... i would suggest to create a CI step to check the links on push, so that this doesn't occur again. and while I could fix those links with a rather small PR, this will happen again unless such a process is setup. and that part i'm not familiar enough with to fix.
@rhatdan PR in #175 but the broader issue will need more work.
@TomSweeneyRedHat asked in #175 which tools can be used to automate such checks... since this is a static website, what you want is a link checker. i happen to have inherited the maintenance of such a tool, called exactly that, linkchecker. it's kind of clunky and old, but it generally works. by default, it spiders the whole site but doesn't check external URLs, so it doesn't find the broken links I reported in #175 (because they are external).
anarcat@curie:~(master)$ LANG=C.UTF-8 linkchecker https://podman.io/
INFO linkcheck.cmdline 2020-01-06 15:49:01,728 MainThread Checking intern URLs only; use --check-extern to check extern URLs.
LinkChecker 9.4.0 Copyright (C) 2000-2014 Bastian Kleineidam
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it
under certain conditions. Look at the file `LICENSE' within this
distribution.
Get the newest version at http://wummel.github.io/linkchecker/
Write comments and bugs to https://github.com/wummel/linkchecker/issues
Support this project at http://wummel.github.io/linkchecker/donations.html
Start checking at 2020-01-06 15:49:01-004
10 threads active, 40 links queued, 50 links in 100 URLs checked, runtime 1 seconds
10 threads active, 102 links queued, 125 links in 242 URLs checked, runtime 6 seconds
10 threads active, 91 links queued, 137 links in 243 URLs checked, runtime 11 seconds
10 threads active, 74 links queued, 172 links in 261 URLs checked, runtime 16 seconds
10 threads active, 63 links queued, 214 links in 292 URLs checked, runtime 21 seconds
10 threads active, 47 links queued, 230 links in 292 URLs checked, runtime 26 seconds
10 threads active, 34 links queued, 243 links in 292 URLs checked, runtime 31 seconds
10 threads active, 21 links queued, 288 links in 324 URLs checked, runtime 36 seconds
10 threads active, 8 links queued, 318 links in 341 URLs checked, runtime 41 seconds
3 threads active, 0 links queued, 359 links in 367 URLs checked, runtime 46 seconds
Statistics:
Downloaded: 963.18KB.
Content types: 5 image, 121 text, 0 video, 0 audio, 5 application, 2 mail and 229 other.
URL lengths: min=15, max=215, avg=52.
That's it. 362 links in 367 URLs checked. 0 warnings found. 0 errors found.
Stopped checking at 2020-01-06 15:49:48-004 (47 seconds)
anarcat@curie:~(master)$
it's possible to tell linkchecker to crawl external links, but then it becomes a web crawler and can potentially crawl the entire universe.
the way I use it for my site is that I run this, for every $URL modified:
linkchecker --check-extern --no-robots --recursion-level 1 --quiet --no-status $URL
so, for example, in the case of the affected page:
anarcat@curie:~(master)$ LANG=C.UTF-8 linkchecker --check-extern --no-robots --recursion-level 1 https://podman.io/whatis.html
LinkChecker 9.4.0 Copyright (C) 2000-2014 Bastian Kleineidam
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it
under certain conditions. Look at the file `LICENSE' within this
distribution.
Get the newest version at http://wummel.github.io/linkchecker/
Write comments and bugs to https://github.com/wummel/linkchecker/issues
Support this project at http://wummel.github.io/linkchecker/donations.html
Start checking at 2020-01-06 15:50:33-004
4 threads active, 0 links queued, 4 links in 8 URLs checked, runtime 1 seconds
URL `https://github.com/containers/libpod/blob/master/docs/podman-generate-kube.1.md'
Name `podman-generate-kube'
Parent URL https://podman.io/whatis.html, line 56, col 3
Real URL https://github.com/containers/libpod/blob/master/docs/podman-generate-kube.1.md
Check time 1.398 seconds
Result Error: 404 Not Found
URL `https://github.com/containers/libpod/blob/master/docs/podman-play-kube.1.md'
Name `podman-play-kube'
Parent URL https://podman.io/whatis.html, line 54, col 3
Real URL https://github.com/containers/libpod/blob/master/docs/podman-play-kube.1.md
Check time 1.858 seconds
Result Error: 404 Not Found
Statistics:
Downloaded: 3KB.
Content types: 2 image, 6 text, 0 video, 0 audio, 0 application, 0 mail and 0 other.
URL lengths: min=29, max=79, avg=44.
That's it. 8 links in 9 URLs checked. 0 warnings found. 2 errors found.
Stopped checking at 2020-01-06 15:50:35-004 (2 seconds)
[1]anarcat@curie:~(master)$
the w3c also maintains their own crawler, called w3c-linkchecker, although I have less experience with it. i started using linkchecker because:
w3c-linkchecker
respects robots.txt
and I wanted to bypass that: even if I'm a bot, I should be able to check if a resource exists at allw3c-linkchecker
was very unlikely to accept a patch to change that, for obvious reasonsanyways, long story short: use a linkchecker, any linkchecker. :)
Another solution would to use a tool like textlint for markdown files. This can also be used for many other use cases like line length, trailing whitespace, wording, spell checks, etc.
If you want to stick with ruby (because of jekyll), there is a tool called htmlproofer, which can be used right after the jekyll build to check the generated html for validity.
I don't think we are against any of these tools. If we get contributors who want to add PRs to verify the content, then we would definitely consider it.
@anarcat @daniel-wtd Did you guys ever work on this?
Nope, I have started to work on another issue. If you want, I can have a look at some basic testing afterwards.
Well any help you can give is appreciated. I don't have priority of one over the other.
Phew, thats a toooon of links. Is there currently any automation process to run checks on pull-request like travis-ci or similar?
@cevich @edsantiago are either of you aware of automatic link checks we could use per @daniel-wtd question above?
the link check can be provided from me. My interest is more like:
"what kind of automation options do we have for checks, based on pull requests" ;)
There is a ton of stuff out there like travis-ci, cricle-ci, cirrus, etc. Depending on your preferences, we can use one of them or I can provide some simple tests to be run manually.
There is a ton of stuff out there like travis-ci, cricle-ci, cirrus, etc.
My preference would be to use Cirrus-CI since it's already in such wide-spread use. Running tasks in containers doesn't require any special setup. In fact, before I go on PTO, I'll add this repo. to the github permissions list...
...it's done. All you need is a .cirrus.yml
file.
@cevich thank you a ton :) I will start this weekend with some initial markdown / link checks.
@cevich @daniel-wtd AWESOME! Both of you get a Gold Star today! :1st_place_medal:
Just a short note for me/for anybody who may be interested. I am working on implementing some basic checks like described here:
This code worked well for me:
~/code/podman.io#master$ cat test_site_links.sh
#!/bin/bash
set -ex
docker run --rm \
--volume="$PWD:/srv/jekyll" \
-it jekyll/jekyll:pages \
jekyll build
export HTML_PROOFER_VERSION=3.18.5
docker run --rm \
--volume="$PWD/_site:/srv/podman.io" \
parkr/html-proofer:3.18.5 \
/srv/podman.io
Running on macOS.
This could be fairly easily converted into a GitHub Actions workflow if that is an acceptable platform to use for the Containers org.
Ouh, I totally forgot about this in the meantime. Thanks for the reminder. I will plan it for January.
I am open to ideas on how to keep these blogs working. So whatever the community decides is fine with me.
it seems dead links are not checked when changes are pushed to this repository, or at least there isn't a job doing that regularly, because I can easily find some. ;) I don't remember the ones I found the last time (and unfortunately did not report), but today I found two in the whatis page:
There might be other dead links on the site worth fixing.