'only_new_records' doesnt write any new records to HBase when set to 'true'

OpenSourceMasters / hbase-writer

HBase-Writer is a java extension to the Heritrix open source crawler. Heritrix is written by the Internet Archive and HBase Writer enables Heritrix to store crawled content directly into HBase tables running on the Hadoop Distributed FileSystem?. By default, HBase-Writer writes crawled url content into an HBase table as individual records or "rowkeys". Each fetched url is represented by a "rowkey" in an HBaase table. However, HBase-Writer can easily be extended for custom behavior, like writing to multiple tables or anything else. In turn, these HBase tables are directly supported by the MapReduce? framework via Hadoop. HBase-Writer's goal is to facilitate in fast large distributed crawls using Heritrix and to save and manage Web-scale content using HBase.

http://opensourcemasters.org/

Other

3 stars 3 forks source link

'only_new_records' doesnt write any new records to HBase when set to 'true' #9

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Configure a brand new crawl in Heritrix on a new table and set
'only_new_records' to 'true'

What is the expected output? What do you see instead?
Heritrix fetches the data, but no records are written to HBase.

This will have to be tested a bit to see where this is happening.

Original issue reported on code.google.com by ryan.justin.smith@gmail.com on 13 Feb 2009 at 4:34

GoogleCodeExporter commented 9 years ago

Original comment by ryan.justin.smith@gmail.com on 13 Feb 2009 at 4:34

Changed title: 'only_new_records' doesnt download any records when set to 'true'

GoogleCodeExporter commented 9 years ago

Original comment by ryan.justin.smith@gmail.com on 13 Feb 2009 at 4:35

Changed title: 'only_new_records' doesnt write any new records to HBase when set to 'true'

GoogleCodeExporter commented 9 years ago

This has been tested with the logic now residing in shouldWrite() in
HbaseWriterProcessor.java

If you crawl a brand new site with "only_new_records" set to "true" , it 
downloads
all urls configured to get by heritrix.  If you run this exact same heritrix jo
configuration a 2nd time, no new records will be downloaded or written to hbase.

Original comment by ryan.justin.smith@gmail.com on 16 Feb 2009 at 6:58

Changed state: Fixed