Heritrix still downloads content when only_new_records is set to true.

OpenSourceMasters / hbase-writer

HBase-Writer is a java extension to the Heritrix open source crawler. Heritrix is written by the Internet Archive and HBase Writer enables Heritrix to store crawled content directly into HBase tables running on the Hadoop Distributed FileSystem?. By default, HBase-Writer writes crawled url content into an HBase table as individual records or "rowkeys". Each fetched url is represented by a "rowkey" in an HBaase table. However, HBase-Writer can easily be extended for custom behavior, like writing to multiple tables or anything else. In turn, these HBase tables are directly supported by the MapReduce? framework via Hadoop. HBase-Writer's goal is to facilitate in fast large distributed crawls using Heritrix and to save and manage Web-scale content using HBase.

http://opensourcemasters.org/

Other

3 stars 3 forks source link

Heritrix still downloads content when only_new_records is set to true. #8

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

=What steps will reproduce the problem?=
1. Crawl a small site.
2. Copy the crawl to a new job and Set the option of only_new_records to "true"
3. Crawl the site again

If you monitor the wire, the content is fetched.  the content should not be
fetched if set to true.

The fix is to move the only_new_records if block to shouldWrite() instead
of shouldProcess().

Original issue reported on code.google.com by ryan.justin.smith@gmail.com on 13 Feb 2009 at 12:03

GoogleCodeExporter commented 9 years ago

Logic has been moved to shouldWrite() and tested further.  Tested on a few small
crawls and heritrix is not downloading any content that has been fetched into 
hbase
already.

Original comment by ryan.justin.smith@gmail.com on 16 Feb 2009 at 6:56

Changed state: Fixed