Contents of this repo

Various resources and documentation on the creation of a static archived version of the KB Research blog, to be hosted at the location of the current 'live' blog.

Documentation

Step 1: scrape one blog post + all resources

wget --page-requisites --span-hosts --convert-links --adjust-extension -w 5 --random-wait http://blog.kbresearch.nl/2015/07/07/why-pdfa-validation-matters-even-if-you-dont-have-pdfa/ >>$logFile 2>&1

Result: Blog post + CSS + images + comments render OK with network disabled! Download directory contains following domain-specific subdirectories:

0.gravatar.com
1.gravatar.com
blog.kbresearch.nl
fonts.googleapis.com
fonts.gstatic.com
pixel.wp.com
platform.twitter.com
researchkb.files.wordpress.com
r-login.wordpress.com
s0.wp.com
s1.wp.com
s2.wp.com
stats.wp.com
widgets.wp.com

Step 2: scrape the whole blog

Use the --domains option, and set its value to the list of domains we got from the previous step. Exceptions:

ignore domain gravatar.com (including it results in scraping of over 60 subdomains, and it is not that important)
ignore domain twitter.com

So we get the following shell script:

url=http://blog.kbresearch.nl
domains=blog.kbresearch.nl,wp.com,researchkb.files.wordpress.com,googleapis.com,gstatic.com
logFile=wget.log
wget --mirror --page-requisites --span-hosts --convert-links --adjust-extension -w 5 --random-wait --domains=$domains $url >>$logFile 2>&1

This results in 153 MB of data. Index and individual blog posts load correctly without a network connection, but there are some issues.

bitsgalore / archive-kbresearch

readme

Contents of this repo

Documentation

Step 1: scrape one blog post + all resources

Step 2: scrape the whole blog