Various resources and documentation on the creation of a static archived version of the KB Research blog, to be hosted at the location of the current 'live' blog.
wget --page-requisites --span-hosts --convert-links --adjust-extension -w 5 --random-wait http://blog.kbresearch.nl/2015/07/07/why-pdfa-validation-matters-even-if-you-dont-have-pdfa/ >>$logFile 2>&1
Result: Blog post + CSS + images + comments render OK with network disabled! Download directory contains following domain-specific subdirectories:
0.gravatar.com
1.gravatar.com
blog.kbresearch.nl
fonts.googleapis.com
fonts.gstatic.com
pixel.wp.com
platform.twitter.com
researchkb.files.wordpress.com
r-login.wordpress.com
s0.wp.com
s1.wp.com
s2.wp.com
stats.wp.com
widgets.wp.com
Use the --domains
option, and set its value to the list of domains we got from the previous step. Exceptions:
So we get the following shell script:
url=http://blog.kbresearch.nl
domains=blog.kbresearch.nl,wp.com,researchkb.files.wordpress.com,googleapis.com,gstatic.com
logFile=wget.log
wget --mirror --page-requisites --span-hosts --convert-links --adjust-extension -w 5 --random-wait --domains=$domains $url >>$logFile 2>&1
This results in 153 MB of data. Index and individual blog posts load correctly without a network connection, but there are some issues.