Closed Eric-Arellano closed 1 year ago
Some thoughts before I implement this:
I was hoping to simply duplicate what we do in closed source, which starts up a server, uses a custom web crawler to find all links, and then checks if they're valid.
This won't work well because starting up a server is non-trivial in this repository, since the infrastructure does not live here. Instead, we start up the server via Docker, which can be intimidating for docs contributors and also can be very slow. We want to keep the check easy and fairly fast to run locally.
Instead of starting a web server and crawling it, statically analyze the MDX and Jupyter notebook files to extract out their links. For internal links, simply check that the file exists. For external links, make a request.
Our Jupyter notebooks use Markdown syntax, so we can use the same Markdown parser between MDX and Jupyter. We'd use a Markdown parsing library for this.
For internal links, simply check that the file exists.
That is advantageous because it's much faster and avoids the flakiness of network connections.
I wanted to use https://github.com/tcort/markdown-link-check, a popular tool to check links for Markdown. But it has two issues:
So, instead, we'd use a custom script, similar to how we do it in closed source. But with a focus on maintainability and reusing libraries for e.g. parsing Markdown.
We want to make sure that anchor links work correctly to take you to the right part of the page. That is a big benefit of Sphinx and was one of our concerns when migrating from Sphinx to MDX.
You can't determine that an anchor is valid via a normal HTTP request. But that's okay, because we only prioritize checking that internal links are using anchors correctly.
So, we will first statically map every file to its anchors, and then check that mapping for internal links.
I talked it over with the original author of the link checker in closed source. We think the static analysis approach is reasonable given the context of this open source project. We'll still have the full link checker from closed source.
I recommend we split the implementation up into stages:
The priority is internal links. It's fine to do external links in a later follow up.
Frank is implementing this first part of internal link validation in https://github.com/Qiskit/documentation/pull/173. That leaves the two follow up improvements:
Some thoughts on this.
Checking external links can be slow since we have to make an HTTP request. The majority of our links are internal, also, so it's lower priority to check external links every time.
We don't want to check external links by default because it's important that npm run check
is fast when run locally so that our content writers are encouraged to run it frequently. So, instead we should have a command line argument --external
to say to also check external links, like npm run check:links -- --external
. We'll want to enable that in CI. We use yargs
for CLI args - look in the file for examples.
For the actual checking, you can use fetch
. It's important that you de-duplicate all the URLs beforehand to reduce the number of requests. Here's a link checker that I've written before: https://github.com/ParkingReformNetwork/reform-map/blob/f6823ea05c65ff54e6f60153a4581448001c21e7/scripts/brokenLinks.js. Note that you have to set a user agent in the header, and that I used a for loop intentionally to reduce the risk of rate limiting.
https://github.com/Qiskit/documentation/pull/173 already has a set of all the valid files. We want to improve this mapping to include anchor information. Note that anchors are only for docs/
, not for public/images
.
I'm not sure the best data structure to store this, such as a Set<string>
where each entry is either a full anchor path like my_file#anchor
or my_image.png
, or something more complex.nested like keeping track of my_file
having 5 anchors. I imagine the former is simpler.
The hard part here is needing to extract out the valid anchors from our MDX and Jupyter files. I believe the packages we have in package.json
dependencies can help with this, like rehype
or remark
. But I haven't used them closely.
If it isn't too hard to implement, it would be really neat to implement a "Did you mean?" functionality. In the past, I've seen people use this algorithm: https://en.wikipedia.org/wiki/Levenshtein_distance.
Set a time box for this, along with a code complexity limit. This feature is "nice to have" but not essential. It's not worth spending a ton of time or adding extremely complex code to add it.
The hard part here is needing to extract out the valid anchors from our MDX and Jupyter files.
Actually, markdown-link-extractor
also extracts all the anchors from the markdown files too. So you should be able to do
markdownLinkExtractor(source).anchors
Maybe we should replace the list of filepath strings with a list of objects
interface File {
path: string
links: Link[]
anchors: string[]
}
Then you could do something like
linkedFile = files.find(file => file.path === link.value)
if linkedFile.anchors.includes(link.anchor) {
...
I split this out into the more granular https://github.com/Qiskit/documentation/issues/305 and https://github.com/Qiskit/documentation/issues/306.
The core link checker has been added and works great thanks to @arnaucasau and @frankharkins. Thanks!
We should make sure that internal links are valid for the state of the docs in the PR, i.e. HEAD. That is, if we reorganize HTML pages, that's fine as long as we update the right links.
External links should be valid no matter what.