ArchiveBot
More info
ArchiveBot has two major backend components: the control node, which runs the IRC interface and bookkeeping programs, and the crawlers, which do all the Web crawling. ArchiveBot users communicate with ArchiveBot by issuing commands in an IRC channel.
User's guide: http://archivebot.readthedocs.org/en/latest/ Control node installation guide: INSTALL.backend Crawler installation guide: INSTALL.pipeline
ArchiveBot was originally written as a set of separate programs for deployment on a server. This means it has a poor distribution story. However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline, dashboard, ignores, and control system and created a package intended for personal use. You can find it at https://github.com/ArchiveTeam/grab-site.
Copyright 2013 David Yip; made available under the MIT license. See LICENSE for details.
Thanks to Alard (@alard), who added WARC generation and Lua scripting to GNU Wget. Wget+lua was the first web crawler used by ArchiveBot.
Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web crawler.
Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and tracking down performance problems at scale.
Other thanks go to the following projects:
Dragonette, Barnaby Bright, Vienna Teng, NONONO.
The memory hole of the Web has gone too far. Don't look down, never look away; ArchiveBot's like the wind.
vim:ts=2:sw=2:tw=72:et