A tool to create a static version of a website for hosting on S3.
One of our clients needed a reliable emergency backup for a website. If the website goes down this backup would be available with reduced functionality.
S3 and Route 53 provide an great way to host a static emergency backup for a website. See this article - http://aws.typepad.com/aws/2013/02/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting.html . In our experience it works well and is incredibly cheap. Our average sized website with a few hundred pages and assets is less than US$1 a month.
We tried using existing tools httrack/wget to crawl and create a static version of the site to upload to S3, but we found that they did not work well with S3 hosting. We wanted the site uploaded to S3 to respond to the exact same URLs (where possible) as the existing site. This way when the site goes down incoming links from Google search results etc. will still work.
Add this line to your application's Gemfile:
gem 'staticizer'
And then execute:
$ bundle
Or install it yourself as:
$ gem install staticizer
Staticizer can be used through the commandline tool or by requiring the library.
staticizer http://squaremill.com -output-dir=/tmp/crawl
staticizer http://squaremill.com -aws-s3-bucket=squaremill.com --aws-access-key=HJFJS5gSJHMDZDFFSSDQQ --aws-secret-key=HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s
staticizer http://squaremill.com --valid-domains=squaremill.com,www.squaremill.com,img.squaremill.com
For all these examples you must first:
require 'staticizer'
This will only crawl urls in the domain squaremill.com
s = Staticizer::Crawler.new("http://squaremill.com",
:aws => {
:region => "us-west-1",
:endpoint => "http://s3.amazonaws.com",
:bucket_name => "www.squaremill.com",
:secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
:access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
}
)
s.crawl
s = Staticizer::Crawler.new("http://squaremill.com", :output_dir => "/tmp/crawl")
s.crawl
s = Staticizer::Crawler.new("http://squaremill.com",
:output_dir => "/tmp/crawl",
:process_body => lambda {|body, uri, opts|
# not the best regex, but it will do for our use
body = body.gsub(/<meta\s+name=['"]robots[^>]+>/i,'')
body = body.gsub(/<head>/i,"<head>\n<meta name='robots' content='noindex'>")
body
}
)
s.crawl
s = Staticizer::Crawler.new("http://squaremill.com",
:aws => {
:region => "us-west-1",
:endpoint => "http://s3.amazonaws.com",
:bucket_name => "www.squaremill.com",
:secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
:access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
},
:filter_url => lambda do |url, info|
# Only crawl URL if it matches squaremill.com or www.squaremil.com
if url =~ %r{https?://(www\.)?squaremill\.com}
# Rewrite non-www urls to www
return url.gsub(%r{https?://(www\.)?squaremill\.com}, "http://www.squaremill.com")
end
# returning nil here prevents the url from being crawled
end
)
s.crawl
git checkout -b my-new-feature
)git commit -am 'Add some feature'
)git push origin my-new-feature
)