Debian / debiman

debiman generates a static manpage HTML repository out of a Debian archive
Apache License 2.0
194 stars 46 forks source link

Explore generating apache RewriteMaps #29

Closed stapelberg closed 7 years ago

stapelberg commented 7 years ago

See https://httpd.apache.org/docs/2.4/rewrite/rewritemap.html for details.

This would result in an entirely static page, enabling us to distribute it across different machines more easily.

stapelberg commented 7 years ago

I have a proof-of-concept. The resulting RewriteMap in textual form is 419MB big, and 650MB in dbm form.

Since apache doesn’t allow integrating with the content negotiation process, we would need to do our own Accept-Language parsing, which could look like this:

    # Replace e.g. pt-BR with pt_BR.
    RequestHeader edit Accept-Language "-" "_" early

    # Only SetEnvIf gets called _before_ RewriteRule.
    SetEnvIf Accept-Language "^(.*)$" ACCLANG=$1
    SetEnvIf Accept-Language "^$" ACCLANG=en

    RewriteEngine on
    RewriteMap all dbm:/srv/man/rwmap-all.dbm

        # chomp off the first language tag for use in the following rules
        RewriteCond "%{env:ACCLANG}" "^([^,;]+)"
        RewriteRule .* - [E=ACCTOK:%1]

        RewriteCond "${all:$1.%{env:ACCTOK}}" "^(.+)$"
        RewriteRule ^(.+)$ /%1 [redirect=307,last]

    # while ACCLANG is non-empty, repeat
    RewriteCond "%{env:ACCLANG}" "^(?:[^,]+),(.+)"
    RewriteRule .* - [E=ACCLANG:%1,N]

    # fallback: maybe the language is already included?
    RewriteCond "${all:$1}" "^(.+)$"
    RewriteRule ^(.+)$ /%1 [redirect=307,last]
stapelberg commented 7 years ago

The (sorted) rewrite map in text format is actually fairly well compressible using rsync -z:

$ rsync -v --no-whole-file -z --stats rwmap-all-sorted.txt ../old/rwmap-all-sorted.txt
rwmap-all-sorted.txt                                                                                     

Number of files: 1 (reg: 1)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 1
Total file size: 440,812,135 bytes
Total transferred file size: 440,812,135 bytes
Literal data: 120,307,171 bytes
Matched data: 320,504,964 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 13,046,699
Total bytes received: 146,721

sent 13,046,699 bytes  received 146,721 bytes  1,759,122.67 bytes/sec
total size is 440,812,135  speedup is 33.41
rsync -v --no-whole-file -z --stats rwmap-all-sorted.txt   8,01s user 0,28s system 115% cpu 7,192 total
stapelberg commented 7 years ago

Capturing the discussion results on #debian-admin:

Two features cannot be covered by a pure RewriteMap-based solution:

  1. Context-based redirects (see issue #25)
  2. 404 pages with suggestions

As for ①, we could change the mechanism so that /jump requests are prefixed with the current suite, and the rewritemap will contain entries for all known suites, falling back where necessary.

As for ②, we could redirect the user to manpages-$MASTER.debian.org or similar.

stapelberg commented 7 years ago

Out of curiosity, I benchmarked both setups by sending each setup 175370 requests (actual user traffic extracted from the apache access.log):

nginx+auxserver: 69,59s user 31,37s system 517% cpu 19,519 total (9230 queries/s) apache+rewritemap: 110,25s user 24,72s system 240% cpu 56,184 total (3131 queries/s)

Both numbers are more than sufficient for the current load, but it’s nice to see my intuition about the performance characteristics confirmed.

stapelberg commented 7 years ago

Also, the complete (as opposed to proof-of-concept) rewrite map contains 19107257 entries and clocks in at 1.6GB.

stapelberg commented 7 years ago

Latest status update: generating the rewritemap takes <5m on my workstation, but literally hours on hosts where random disk IO is costly.

stapelberg commented 7 years ago

We should try reducing the map size by moving a few simple parts into rewrite rules:

Further, we might be able to split the map into one part per suite, to further reduce its size.

stapelberg commented 7 years ago

The legacy URL schema has been moved out of the map with commit https://github.com/Debian/debiman/commit/cc38aee4d74ce2bc500de2515b973b9affe97a3f

Further, I realized that using DB_BTREE as database format is much more performant than DB_HASH. We now use the following script to convert the index to a dbm RewriteMap:

# convert the index
TMPDIR=$(mktemp -d -p /srv/manpages.debian.org/www rwmap-tmpXXXXXX)
function cleanup {
  rm -rf "$TMPDIR"
}
trap cleanup EXIT

/srv/manpages.debian.org/debiman/gopath/bin/debiman-idx2rwmap -index=/srv/manpages.debian.org/www/auxserver.idx -output_dir=$TMPDIR
LC_ALL=C sort ${TMPDIR}/output.* > ${TMPDIR}/rwmap.txt
# Create an empty DB_BTREE berkeley db file: httxt2dbm does not offer changing
# the file format, but will respect the file format of an already existing
# output file.
echo -n H4sICEEvilgAA2VtcHR5My5kYm0A7dexCYBADIXhF+GK624DLVxAbJzBMVzBNRzO0tpJPDnlRLEW4f8gJCRkgCdJpmRonPw+hFg+7c7b1dguqmPv+nmdyrwvjl69/AEAAAAAgO+YHnk9mMu3O/I/AAAAAAD/sgEYIbKQACAAAA== | base64 -d | gunzip -c > ${TMPDIR}/rwmap.dbm
httxt2dbm -f DB -i ${TMPDIR}/rwmap.txt -o ${TMPDIR}/rwmap.dbm
mv ${TMPDIR}/rwmap.txt /srv/manpages.debian.org/www/rwmap.txt
mv ${TMPDIR}/rwmap.dbm /srv/manpages.debian.org/www/rwmap.dbm
stapelberg commented 7 years ago

weasel remarked on IRC that we could, as a further optimization, move the “context-aware redirect” feature (i.e. when browsing unstable, jumping to another manpage should result the unstable version, not stable) into the rewritemap processing by introducing a /attempt-<suite>/ URL schema which would first try <suite>, then fall back to regular processing.

stapelberg commented 7 years ago

The RewriteMap based setup is now deployed on the static mirroring infrastructure \o/.