Debian / debiman

debiman generates a static manpage HTML repository out of a Debian archive
Apache License 2.0
188 stars 46 forks source link

dman optimization: redirect to any suite or locale if missing #79

Closed anarcat closed 7 years ago

anarcat commented 7 years ago

the latest dman code uses this loop to fetch raw manpages:

BASE_URL="https://manpages.debian.org"
# [...]
for URL in "$BASE_URL/$DISTRIB_CODENAME/$PAGE$LOCDOT.gz" \
               "$BASE_URL/$DISTRIB_CODENAME/$PAGE.gz" \
               "$BASE_URL/$PAGE$LOCDOT.gz" \
               "$BASE_URL/$PAGE.gz" \

do
    if wget -O "$man" "$URL" 2>/dev/null; then
            man $MAN_ARGS -l "$man" || true
        exit 0
    else
        echo "$0: not found: $URL" 1>&1
    fi
done

ie. it will do 4 hits on the webserver for a missing page. This means that "dman dman" takes 5 seconds to execute here:

$ time ./dman dman  > /dev/null
0.31user 0.05system 0:05.13elapsed 7%CPU (0avgtext+0avgdata 11696maxresident)k
0inputs+8outputs (0major+10652minor)pagefaults 0swaps

switching the base_url to dyn.manpages.org brings this down to about 4 seconds.

but it seems to me this logic could be more efficiently implemented in the dyn.m.o redirector. the logic would be:

would those be welcome improvements?

stapelberg commented 7 years ago

Yeah, these improvements make sense to me.

Originally, I was a bit torn, because when users (or tools, in this case) specify some qualifier (e.g. the suite), we should respect that. But then I thought about it from the other side: users following a now outdated (or malformed to begin with) link will most likely be happier seeing a working page than seeing an error message.

If it turns out that people have legitimate use-cases for exact matching (i.e. without a fallback to the next best thing), we can still add an additional endpoint for that.

Let me know if you need any pointers with regards to implementing this change, and thanks in advance!

anarcat commented 7 years ago

pointers would sure be useful. i looked into fixing language redirection very quickly, and it looks like it should already falling back to the default locale ("en"?):

https://github.com/Debian/debiman/blob/master/internal/redirect/redirect.go#L41

there's a comment there that says:

    // ensure that en comes first, so that language.Matcher treats it as default

or maybe there should be a change in the Narrow function? there a bunch of comparisons there, which seem to be doing what we need...

stapelberg commented 7 years ago

Note that bestLanguageMatch is only called if t.Language == "" :)

Yes, your changes definitely will need to touch the Narrow function.

anarcat commented 7 years ago

would that make sense as a (failing) unit test?

diff --git a/internal/redirect/redirect_test.go b/internal/redirect/redirect_test.go
index 88137e4..ce4d175 100644
--- a/internal/redirect/redirect_test.go
+++ b/internal/redirect/redirect_test.go
@@ -350,6 +350,9 @@ func TestUnderspecified(t *testing.T) {
        {Case: 16, URL: "jessie/i3-wm/i3.1.fr", want: "jessie/i3-wm/i3.1.fr.html"},                             // default suite
        {Case: 16, URL: "testing/i3-wm/i3.1.fr", want: "testing/i3-wm/i3.1.fr.html"},                           // non-default suite
        {Case: 16, URL: "jessie/libedit-dev/editline.3.en", want: "jessie/libedit-dev/editline.3edit.en.html"}, // section with subsection
+
+       {Case: 17, URL: "jessie/i3.1.dne", want: "jessie/i3-wm/i3.1.en.html"}, // non-existent locale
+       {Case: 17, URL: "dne/i3.1.en", want: "jessie/i3-wm/i3.1.en.html"},     // non-existent suite
    }
    for _, entry := range table {
        entry := entry // capture
anarcat commented 7 years ago

okay well, I tried to figure out the "narrow" function, and frankly, I can't. It seems to me it would require a massive refactoring to do what we need here.

i only managed to generate a failling test suite so far:

diff --git a/internal/redirect/redirect_test.go b/internal/redirect/redirect_test.go
index 88137e4..1486db5 100644
--- a/internal/redirect/redirect_test.go
+++ b/internal/redirect/redirect_test.go
@@ -10,6 +10,7 @@ var testIdx = Index{
        Langs: map[string]bool{
                "en": true,
                "fr": true,
+               "es": true,
        },

        Sections: map[string]bool{
@@ -350,6 +351,9 @@ func TestUnderspecified(t *testing.T) {
                {Case: 16, URL: "jessie/i3-wm/i3.1.fr", want: "jessie/i3-wm/i3.1.fr.html"},                             // default suite
                {Case: 16, URL: "testing/i3-wm/i3.1.fr", want: "testing/i3-wm/i3.1.fr.html"},                           // non-default suite
                {Case: 16, URL: "jessie/libedit-dev/editline.3.en", want: "jessie/libedit-dev/editline.3edit.en.html"}, // section with subsection
+
+               {Case: 17, URL: "jessie/i3.1.es", want: "jessie/i3-wm/i3.1.en.html"}, // non-existent locale
+               {Case: 17, URL: "potato/i3.1.en", want: "jessie/i3-wm/i3.1.en.html"}, // non-existent suite
        }
        for _, entry := range table {
                entry := entry // capture

one of the problems I'm having with the narrow function is that there's a huge disconnect between the "filter" and "qualified" functions: the "qualified" function may find the perfect fit in the filtered list, but then when the filtered list is returned, only the first element is taken. this means we rely a lot on sorting the elements, as opposed to just picking the right one from the start, which makes it really hard to introduce multiple ambiguity criterion.

Right now, it seems to be that only the "section" has such sort-based fallback, and I couldn't figure out how to introduce other fallbacks without breaking that one.

stapelberg commented 7 years ago

As discussed via IRC: the tests you posted look fine, but one detail needed to be changed: potato/i3.1.en was parsed as binarypkg=potato.

anarcat commented 7 years ago

i have pushed changes to dman to make use of the new changes, which optimize the "no such manpage" use case by reducing the 4 HTTP calls to a single one. :)

thanks!