Closed stapelberg closed 7 years ago
I have a proof-of-concept. The resulting RewriteMap in textual form is 419MB big, and 650MB in dbm form.
Since apache doesn’t allow integrating with the content negotiation process, we would need to do our own Accept-Language parsing, which could look like this:
# Replace e.g. pt-BR with pt_BR.
RequestHeader edit Accept-Language "-" "_" early
# Only SetEnvIf gets called _before_ RewriteRule.
SetEnvIf Accept-Language "^(.*)$" ACCLANG=$1
SetEnvIf Accept-Language "^$" ACCLANG=en
RewriteEngine on
RewriteMap all dbm:/srv/man/rwmap-all.dbm
# chomp off the first language tag for use in the following rules
RewriteCond "%{env:ACCLANG}" "^([^,;]+)"
RewriteRule .* - [E=ACCTOK:%1]
RewriteCond "${all:$1.%{env:ACCTOK}}" "^(.+)$"
RewriteRule ^(.+)$ /%1 [redirect=307,last]
# while ACCLANG is non-empty, repeat
RewriteCond "%{env:ACCLANG}" "^(?:[^,]+),(.+)"
RewriteRule .* - [E=ACCLANG:%1,N]
# fallback: maybe the language is already included?
RewriteCond "${all:$1}" "^(.+)$"
RewriteRule ^(.+)$ /%1 [redirect=307,last]
The (sorted) rewrite map in text format is actually fairly well compressible using rsync -z
:
$ rsync -v --no-whole-file -z --stats rwmap-all-sorted.txt ../old/rwmap-all-sorted.txt
rwmap-all-sorted.txt
Number of files: 1 (reg: 1)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 1
Total file size: 440,812,135 bytes
Total transferred file size: 440,812,135 bytes
Literal data: 120,307,171 bytes
Matched data: 320,504,964 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 13,046,699
Total bytes received: 146,721
sent 13,046,699 bytes received 146,721 bytes 1,759,122.67 bytes/sec
total size is 440,812,135 speedup is 33.41
rsync -v --no-whole-file -z --stats rwmap-all-sorted.txt 8,01s user 0,28s system 115% cpu 7,192 total
Capturing the discussion results on #debian-admin:
Two features cannot be covered by a pure RewriteMap-based solution:
As for ①, we could change the mechanism so that /jump requests are prefixed with the current suite, and the rewritemap will contain entries for all known suites, falling back where necessary.
As for ②, we could redirect the user to manpages-$MASTER.debian.org
or similar.
Out of curiosity, I benchmarked both setups by sending each setup 175370 requests (actual user traffic extracted from the apache access.log):
nginx+auxserver: 69,59s user 31,37s system 517% cpu 19,519 total (9230 queries/s) apache+rewritemap: 110,25s user 24,72s system 240% cpu 56,184 total (3131 queries/s)
Both numbers are more than sufficient for the current load, but it’s nice to see my intuition about the performance characteristics confirmed.
Also, the complete (as opposed to proof-of-concept) rewrite map contains 19107257 entries and clocks in at 1.6GB.
Latest status update: generating the rewritemap takes <5m on my workstation, but literally hours on hosts where random disk IO is costly.
We should try reducing the map size by moving a few simple parts into rewrite rules:
Further, we might be able to split the map into one part per suite, to further reduce its size.
The legacy URL schema has been moved out of the map with commit https://github.com/Debian/debiman/commit/cc38aee4d74ce2bc500de2515b973b9affe97a3f
Further, I realized that using DB_BTREE
as database format is much more performant than DB_HASH
. We now use the following script to convert the index to a dbm RewriteMap:
# convert the index
TMPDIR=$(mktemp -d -p /srv/manpages.debian.org/www rwmap-tmpXXXXXX)
function cleanup {
rm -rf "$TMPDIR"
}
trap cleanup EXIT
/srv/manpages.debian.org/debiman/gopath/bin/debiman-idx2rwmap -index=/srv/manpages.debian.org/www/auxserver.idx -output_dir=$TMPDIR
LC_ALL=C sort ${TMPDIR}/output.* > ${TMPDIR}/rwmap.txt
# Create an empty DB_BTREE berkeley db file: httxt2dbm does not offer changing
# the file format, but will respect the file format of an already existing
# output file.
echo -n H4sICEEvilgAA2VtcHR5My5kYm0A7dexCYBADIXhF+GK624DLVxAbJzBMVzBNRzO0tpJPDnlRLEW4f8gJCRkgCdJpmRonPw+hFg+7c7b1dguqmPv+nmdyrwvjl69/AEAAAAAgO+YHnk9mMu3O/I/AAAAAAD/sgEYIbKQACAAAA== | base64 -d | gunzip -c > ${TMPDIR}/rwmap.dbm
httxt2dbm -f DB -i ${TMPDIR}/rwmap.txt -o ${TMPDIR}/rwmap.dbm
mv ${TMPDIR}/rwmap.txt /srv/manpages.debian.org/www/rwmap.txt
mv ${TMPDIR}/rwmap.dbm /srv/manpages.debian.org/www/rwmap.dbm
weasel remarked on IRC that we could, as a further optimization, move the “context-aware redirect” feature (i.e. when browsing unstable, jumping to another manpage should result the unstable version, not stable) into the rewritemap processing by introducing a /attempt-<suite>/
URL schema which would first try <suite>
, then fall back to regular processing.
The RewriteMap based setup is now deployed on the static mirroring infrastructure \o/.
See https://httpd.apache.org/docs/2.4/rewrite/rewritemap.html for details.
This would result in an entirely static page, enabling us to distribute it across different machines more easily.