Linkchek has problems verifying manpage.ubuntu.com

dviererbe commented 2 months ago

The linkchecks for manpage.ubuntu.com urls recently started failing for the Ubuntu Packaging Guide (see an example log)despite working fine on manual inspection in firefox and chrome.

@s-makin also mentioned that she had a similar experience with the Ubuntu Pro documentation. Thats why I escalate this to a general canonical sphinx documentation issue.

I propose that we configure linkcheck to ignore manpage.ubuntu.com urls and run a script that greps all :manpage:value occurrences to check for working manpages using curl.

ru-fu commented 2 months ago

I agree that it makes sense to handle that in the starter pack.

I'm a bit doubtful though if we can do anything except ignoring manpage.ubuntu.com. I just ran the check locally, and it didn't report the link broken. So either this is a problem that doesn't occur consistently, or it's something that only occurs on GitHub and not locally - either way, why would curl work better?

dviererbe commented 2 months ago

either way, why would curl work better?

To be honest, I don't know. I just tested it and it worked.

ru-fu commented 2 months ago

To be honest, I don't know. I just tested it and it worked.

On GitHub? And consistently? I'm not opposed to this solution, but it would be good to at least have a suspicion on why one works and the other doesn't.

dviererbe commented 2 months ago

note: I found out why curl works. For some reason the 404 does not throw an error.

$ curl https://manpages.ubuntu.com/manpages/en/man1/diff.1.html
<script language="JavaScript">
  var q = location.href;
  if (q.search(/\/manpages\/.*\/man[0-9]\/.*[^0-9]\.html$/) >= 0) {
    // Current location matches a legacy link, with just a plain .html filename
    // Try to redirect to the new location, which has a .[0-9].html filename
    location.replace(q.replace(/\/man([0-9])\/(.*)\.html/, "/man$1/$2.$1.html"));
  } else {
    // Try redirecting to a page of search results
    q = q.replace(/.*\//, "");
    q = q.replace(/\.html$/, "");
    location.replace('/cgi-bin/search.py?cx=003883529982892832976%3A5zl6o8w6f0s&cof=FORID%3A9&ie=UTF-8&titles=404&lr=lang_en&q=' + q);
  }
</script>

<!doctype html>
<html lang="en" dir="ltr">
  <head>
    [...]
    <title>Ubuntu Manpage:

Not Found
    </title>
    [...]
  </head>
  <body>
    [...]
  </body>
</html>

dviererbe commented 2 months ago

The manpages are a client site redirect by JavaScript. The tests started failing, because we recently switched to

https://manpages.ubuntu.com/manpages/en/man{section}/{page}.{section}.html

from something like

https://manpages.ubuntu.com/manpages/{dist}/en/man{section}/{page}.{section}.html

ru-fu commented 2 months ago

That sounds like the clean solution would be to update the links?

dviererbe commented 2 months ago

That sounds like the clean solution would be to update the links?

suggestion: Yes, we could do something like this in custom_conf.py:

manpages_distribution = subprocess.check_output("distro-info --stable", 
                                                text=True, shell=True).strip()
manpages_url = ("https://manpages.ubuntu.com/manpages/"
               f"{manpages_distribution}/en/"
               "man{section}/{page}.{section}.html")

dviererbe commented 2 months ago

If there ever should be a need to parse the :manpage: directives. I re-wrote the bash script as a python script, before I found out why curl was working :'(

import os
import sys
import re
import requests

sys.path.append('./')
from custom_conf import *

class ManPage:
    def __init__(self, name: str, section: str) -> None:
        self.Name = name
        self.Section = section
        self.FullName=f"{name}({section})"

    def __str__(self) -> str:
        return self.FullName

    def __hash__(self) -> int:
        return hash(self.FullName)

    def __eq__(self, value: object) -> bool:
        return isinstance(value, ManPage) and self.FullName == value.FullName

    def __ne__(self, value: object) -> bool:
        return not isinstance(value, ManPage) or self.Name != value.Name

def FindReStructuredTextFiles(root: str) -> list[str]:
    rstFiles = []
    directories = [ root ]

    while len(directories) > 0:
        directory = directories.pop()
        entries = os.listdir(directory)

        for entry in entries:
            # ignore dotfiles (e.g. .sphinx)
            if entry.startswith("."):
                continue

            fullPath = os.path.join(directory, entry)

            if os.path.isfile(fullPath):
                if entry.endswith('.rst'):
                    rstFiles.append(fullPath)
            else:
                directories.append(fullPath)

    return rstFiles

def ParseUsedManpages(filePaths: list[str]) -> set[ManPage]:
    manpages = set()
    pattern = re.compile(
        pattern=r':manpage:`(?P<page>[a-z0-9-.]+)\((?P<section>\w+)\)`', 
        flags=re.DOTALL | re.IGNORECASE)

    for filePath in filePaths:
        with open(filePath, 'r') as file:
            for line in file:
                matches = pattern.findall(line)

                for match in matches:
                    manpage=ManPage(name=match[0], section=match[1])

                    manpages.add(manpage)

    return manpages

def IsAccessible(url) -> tuple[bool, Exception | None]:
    try: 
        response = requests.get(url, allow_redirects=True, timeout=5)
        response.raise_for_status()
        return (True, None)
    except Exception as error:
        return (False, error)

def VerifyThatAllManPagesAreAccessible() -> None:
    docsFolder = os.path.join(os.getcwd(), "docs")

    files = FindReStructuredTextFiles(docsFolder)
    usedManpages = ParseUsedManpages(files)

    allPagesAccessible = True

    for manpage in usedManpages:
        manpageUrl = manpages_url.format(page=manpage.Name,
                                         section=manpage.Section)

        isAccessible, error = IsAccessible(manpageUrl)

        if isAccessible:
            print(f"INFO: Fetching {manpage} (URL='{manpageUrl}') succeeded.")
        else:
            allPagesAccessible=False
            print(f"ERROR: Fetching {manpage} (URL='{manpageUrl}') failed!")
            print("DEBUG: ", error)

    if allPagesAccessible:
        print("All man pages are accessible :)")
    else:
        print("One ore more man pages are NOT accessible :'(")
        sys.exit(1)

if __name__ == "__main__":
    VerifyThatAllManPagesAreAccessible()

ru-fu commented 1 month ago

Fixed through https://github.com/canonical/sphinx-docs-starter-pack/pull/226

canonical / sphinx-docs-starter-pack

Linkchek has problems verifying manpage.ubuntu.com #224