HBase-Writer is a java extension to the Heritrix open source crawler. Heritrix is written by the Internet Archive and HBase Writer enables Heritrix to store crawled content directly into HBase tables running on the Hadoop Distributed FileSystem?. By default, HBase-Writer writes crawled url content into an HBase table as individual records or "rowkeys". Each fetched url is represented by a "rowkey" in an HBaase table. However, HBase-Writer can easily be extended for custom behavior, like writing to multiple tables or anything else. In turn, these HBase tables are directly supported by the MapReduce? framework via Hadoop. HBase-Writer's goal is to facilitate in fast large distributed crawls using Heritrix and to save and manage Web-scale content using HBase.
Branch name: /trunk
Purpose of code changes on this branch:
Normally the crawler plugin shouldnt maintain any persistent resources and keep
its memory footprint as small as possible of course. At the same time, we
should use Heritrix pooling implementations for handling resource delegation.
Sometimes long crawls could reveal memory or resource issues that short unit
tests cannot reproduce.
When reviewing my code changes, please focus on:
This ticket is to address the issue of creating some unit tests that simulate
long enough crawls to detect resource usage and object creation by hbase-writer
and to ensure that the pooled objects are being reused properly.
I would imagine that launching jetty or some other web app server by maven
during unit testing would be needed along with a sample web app to simulate a
site to crawl.
Original issue reported on code.google.com by ryan.justin.smith@gmail.com on 23 Jan 2012 at 1:05
Original issue reported on code.google.com by
ryan.justin.smith@gmail.com
on 23 Jan 2012 at 1:05