alan-turing-institute / REG-handbook

A way of working guide for the Research Engineering Group at The Alan Turing Institute
https://alan-turing-institute.github.io/REG-handbook/
Other
6 stars 0 forks source link

Migrate `html-proofer` -> `lychee`, only check internal links #108

Open yongrenjie opened 1 year ago

yongrenjie commented 1 year ago

Problem

Checking the validity of external links is fragile, and many of the external links require authentication anyway, meaning that it is not possible to meaningfully check their validity.

lychee also seems like an improvement over html-proofer, which we currently use in CI.

Proposed steps to solve

  1. Replace html-proofer with lychee in CI. (There is an action available here)
  2. Configure it to only check internal links
  3. Set up a pre-commit hook so that it matches CI. This needs some extra work as lychee doesn't provide one out of the box. I think the most sensible approach would actually be to make a PR to lychee to include one, as the developers seem quite open to this sort of enhancement.
    • pre-commit says that it will bootstrap rust if it's not already installed, so people editing the handbook should not need to separately install rust. (a plus!)
    • Note that html-proofer doesn't have a pre-commit hook either, so we are not losing anything by switching.

@JimMadge I think I have some extra time these couple of weeks, so could start looking into it. Edits etc welcome.

JimMadge commented 1 year ago

If this can work it would be good, more alignment between local tests and CI.

Somethings I'm not sure about,

yongrenjie commented 1 year ago

Do we need to do this at all? If we always use the ref and relref short codes I think Hugo should raise an error if the link is broken

🤨 That is true.

It doesn't check for missing anchors (e.g. {{< ref "/content/docs/contributing/contributing_changes.md#doesnt-exist" >}} is still ok), but I don't know if any tool can do that really.

In which case, the task reduces to making sure that all internal links are specified using either ref or relref. (Do you know if one of these is better than the other?)

In any case, that should be easy enough to do as it'd just be a simple bash script grepping the markdown source code for https://alan-turing-institute.github.io/REG-handbook/*.

JimMadge commented 1 year ago

It doesn't check for missing anchors (e.g. {{< ref "/content/docs/contributing/contributing_changes.md#doesnt-exist" >}} is still ok), but I don't know if any tool can do that really.

Oh, that is interesting. I assumed that would be checked. Not sure if it can be checked when it is translated to html.

In which case, the task reduces to making sure that all internal links are specified using either ref or relref. (Do you know if one of these is better than the other?)

In any case, that should be easy enough to do as it'd just be a simple bash script grepping the markdown source code for https://alan-turing-institute.github.io/REG-handbook/*.

That would be good. Might need to check for Markdown style links [something](../../index.md) too? A little more effort on contributors to save potential headaches. Especially if it can be integrated into pre-commit with some friendly output.

If you do something like that, it would also be appreciated in The Turing Way.

yongrenjie commented 1 year ago

In principle you should be able to check anchors. (for example, I think Sphinx does this)

There are a couple of ways I guess — one would be checking the generated HTML, to ensure that a HTML element on the given page with the given id exists. However, making this robust might be a bit difficult, as just grepping for id="the-anchor-name" might lead to false positives. To do it properly would require parsing the HTML.

I suspect it might be easier to check anchors on the Markdown itself. This would rely on knowledge of which HTML anchors are generated from Markdown (and how), i.e. ## A heading is translated into <h2 id="a-heading">A heading</h2>. The downside is that such a tool would only work with a particular build system, but I think it'd be far easier to write, since Markdown syntax is much more restricted.