Set up link checker - Githubissues

Eric-Arellano commented 1 year ago

We should make sure that internal links are valid for the state of the docs in the PR, i.e. HEAD. That is, if we reorganize HTML pages, that's fine as long as we update the right links.

External links should be valid no matter what.

### Tasks

Eric-Arellano commented 1 year ago

Some thoughts before I implement this:

Don't copy over closed source checker

I was hoping to simply duplicate what we do in closed source, which starts up a server, uses a custom web crawler to find all links, and then checks if they're valid.

This won't work well because starting up a server is non-trivial in this repository, since the infrastructure does not live here. Instead, we start up the server via Docker, which can be intimidating for docs contributors and also can be very slow. We want to keep the check easy and fairly fast to run locally.

Use static analysis

Instead of starting a web server and crawling it, statically analyze the MDX and Jupyter notebook files to extract out their links. For internal links, simply check that the file exists. For external links, make a request.

Our Jupyter notebooks use Markdown syntax, so we can use the same Markdown parser between MDX and Jupyter. We'd use a Markdown parsing library for this.

For internal links, simply check that the file exists.

That is advantageous because it's much faster and avoids the flakiness of network connections.

Custom tooling

I wanted to use https://github.com/tcort/markdown-link-check, a popular tool to check links for Markdown. But it has two issues:

Can't handle the Jupyter notebooks
Makes live requests even for internal links, which is much slower

So, instead, we'd use a custom script, similar to how we do it in closed source. But with a focus on maintainability and reusing libraries for e.g. parsing Markdown.

Future improvement: check anchor links

We want to make sure that anchor links work correctly to take you to the right part of the page. That is a big benefit of Sphinx and was one of our concerns when migrating from Sphinx to MDX.

You can't determine that an anchor is valid via a normal HTTP request. But that's okay, because we only prioritize checking that internal links are using anchors correctly.

So, we will first statically map every file to its anchors, and then check that mapping for internal links.

Eric-Arellano commented 1 year ago

I talked it over with the original author of the link checker in closed source. We think the static analysis approach is reasonable given the context of this open source project. We'll still have the full link checker from closed source.

I recommend we split the implementation up into stages:

Validate internal links
(maybe) Improve internal links to check for anchors
Validate external links

The priority is internal links. It's fine to do external links in a later follow up.

Eric-Arellano commented 1 year ago

Frank is implementing this first part of internal link validation in https://github.com/Qiskit/documentation/pull/173. That leaves the two follow up improvements:

Improve internal links to check for anchors
Validate external links

Some thoughts on this.

External links

Checking external links can be slow since we have to make an HTTP request. The majority of our links are internal, also, so it's lower priority to check external links every time.

We don't want to check external links by default because it's important that npm run check is fast when run locally so that our content writers are encouraged to run it frequently. So, instead we should have a command line argument --external to say to also check external links, like npm run check:links -- --external. We'll want to enable that in CI. We use yargs for CLI args - look in the file for examples.

For the actual checking, you can use fetch. It's important that you de-duplicate all the URLs beforehand to reduce the number of requests. Here's a link checker that I've written before: https://github.com/ParkingReformNetwork/reform-map/blob/f6823ea05c65ff54e6f60153a4581448001c21e7/scripts/brokenLinks.js. Note that you have to set a user agent in the header, and that I used a for loop intentionally to reduce the risk of rate limiting.

Anchor checks

https://github.com/Qiskit/documentation/pull/173 already has a set of all the valid files. We want to improve this mapping to include anchor information. Note that anchors are only for docs/, not for public/images.

I'm not sure the best data structure to store this, such as a Set<string> where each entry is either a full anchor path like my_file#anchor or my_image.png, or something more complex.nested like keeping track of my_file having 5 anchors. I imagine the former is simpler.

The hard part here is needing to extract out the valid anchors from our MDX and Jupyter files. I believe the packages we have in package.json dependencies can help with this, like rehype or remark. But I haven't used them closely.

Bonus credit: "Did you mean"?

If it isn't too hard to implement, it would be really neat to implement a "Did you mean?" functionality. In the past, I've seen people use this algorithm: https://en.wikipedia.org/wiki/Levenshtein_distance.

Set a time box for this, along with a code complexity limit. This feature is "nice to have" but not essential. It's not worth spending a ton of time or adding extremely complex code to add it.

frankharkins commented 1 year ago

The hard part here is needing to extract out the valid anchors from our MDX and Jupyter files.

Actually, markdown-link-extractor also extracts all the anchors from the markdown files too. So you should be able to do

markdownLinkExtractor(source).anchors

Maybe we should replace the list of filepath strings with a list of objects

interface File {
  path: string
  links: Link[]
  anchors: string[]
}

Then you could do something like

linkedFile = files.find(file => file.path === link.value)
if linkedFile.anchors.includes(link.anchor) {
   ...

Eric-Arellano commented 1 year ago

I split this out into the more granular https://github.com/Qiskit/documentation/issues/305 and https://github.com/Qiskit/documentation/issues/306.

The core link checker has been added and works great thanks to @arnaucasau and @frankharkins. Thanks!

Qiskit / documentation

Set up link checker #4

Don't copy over closed source checker

Use static analysis

Custom tooling

Future improvement: check anchor links

External links

Anchor checks

Bonus credit: "Did you mean"?