OpenSourceMasters / hbase-writer

HBase-Writer is a java extension to the Heritrix open source crawler. Heritrix is written by the Internet Archive and HBase Writer enables Heritrix to store crawled content directly into HBase tables running on the Hadoop Distributed FileSystem?. By default, HBase-Writer writes crawled url content into an HBase table as individual records or "rowkeys". Each fetched url is represented by a "rowkey" in an HBaase table. However, HBase-Writer can easily be extended for custom behavior, like writing to multiple tables or anything else. In turn, these HBase tables are directly supported by the MapReduce? framework via Hadoop. HBase-Writer's goal is to facilitate in fast large distributed crawls using Heritrix and to save and manage Web-scale content using HBase.
http://opensourcemasters.org/
Other
3 stars 3 forks source link

Hbase-Writer doesnt pool HTable connections properly any longer. #14

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Using hbase-writer with Heritrix on large crawlers over several hours.
2. The process will hang after too many HTable instances are created

What is the expected output? What do you see instead?
Heritrix should have shut down without hanging.

What version of the product are you using? On what operating system?
N/A

Please provide any additional information below.
The last update removed the pooling logic.  Greg Lu has submitted a patch to 
fix this.  Thanks Greg.

Original issue reported on code.google.com by ryan.justin.smith@gmail.com on 22 Jan 2012 at 2:55

GoogleCodeExporter commented 9 years ago
Greg Lu's patch from the trunk.

Original comment by ryan.justin.smith@gmail.com on 22 Jan 2012 at 2:57

Attachments:

GoogleCodeExporter commented 9 years ago
Changes committed to trunk.

Original comment by ryan.justin.smith@gmail.com on 22 Jan 2012 at 2:58

GoogleCodeExporter commented 9 years ago
Sorry I handled connection pooling differently and missed to include it. Any 
way glad to have a fix from Greg Lu.

Original comment by karthik...@gmail.com on 23 Jan 2012 at 4:23

GoogleCodeExporter commented 9 years ago
It cant be all your fault Karthik!  I reviewed the patch and committed it after 
testing it out myself.  I looked at the code and just assumed it was some much 
simpler way of doing pooling.   Heh, whoops!  And yes, Im very glad that Greg 
Lu has a large enough deployment he can basically do testing on hbase-writer 
for us.  

I just made a ticket to add testing to the unit test phase to check for 
resource pooling.  But from now on, Ill run my test crawl using JMX, and log in 
using jconsole to check object creation before I commit any more patches.  This 
way I can just run a short crawl to check for object creation.

Original comment by ryan.justin.smith@gmail.com on 23 Jan 2012 at 1:10