OpenSourceMasters / hbase-writer

HBase-Writer is a java extension to the Heritrix open source crawler. Heritrix is written by the Internet Archive and HBase Writer enables Heritrix to store crawled content directly into HBase tables running on the Hadoop Distributed FileSystem?. By default, HBase-Writer writes crawled url content into an HBase table as individual records or "rowkeys". Each fetched url is represented by a "rowkey" in an HBaase table. However, HBase-Writer can easily be extended for custom behavior, like writing to multiple tables or anything else. In turn, these HBase tables are directly supported by the MapReduce? framework via Hadoop. HBase-Writer's goal is to facilitate in fast large distributed crawls using Heritrix and to save and manage Web-scale content using HBase.
http://opensourcemasters.org/
Other
3 stars 3 forks source link

max content size limit #6

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The Heritrix2 filter for rejecting objects with content length beyond a
threshold does not seem to work. HBase does not handle well objects in
excess of 100MB. Attached patch provides a mechanism in the writer for
rejecting objects deemed too large. Default threshold is 20MB. I have been
running this locally for a few days now and it works great. 

Original issue reported on code.google.com by andrew.p...@gmail.com on 29 Nov 2008 at 7:31

Attachments:

GoogleCodeExporter commented 9 years ago
Patch has been added to the 0.18 branch and I just released 0.18.2 from the 0.18
branch.  When hbase-0.19 is released, ill merge branch 0.18 to the trunk and 
release
hbase-writer-0.19 from the trunk.  

Original comment by ryan.justin.smith@gmail.com on 3 Dec 2008 at 10:00