bgarrels / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Crawler4j missing more control over retry count #261

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run the Basic Crawler with RobotServer enabled
2. Have "addeasy.netfirms.com" as the seed

What is the expected output? What do you see instead?
Expectation: Should be responsive.
Current Outcome: It gets blocked for so long in addSeed.

What version of the product are you using?
All versions

Please provide any additional information below.

In my crawler framework i use crawler4j. Recently it got a big head ache with 
the domain "addeasy.netfirms.com" which has arround 300 A record in DNS. Hence 
HTTPClient libraries while downloading the page (PageFetcher), it tries all the 
IP blaintly (No config option available) as PoolingClientConnectionManager uses 
DefaultClientConnectionOperator which has a for iteration to try all IPs. 

Again if exception raised, the httpclient tries 3 times. As default retries 
count is 3 in httpclient.

I couldnot find a solution, hence i modifed the crawler4j to use custom 
PoolingManager with modified ConnectionOperator.

After doing the above, i got rid of one issue that retring in all IP. 

But i learnt, that still another hidden issue to it is addSeed.

Because addSeed initialy tries download the robot.txt there itself it get 
blocked that due in main thread. So parallel crawling cannot be done if addSeed 
is initialy called from main thread.

Now solved the issue by having custom controler which accomodate all values 
injected via addSeed to local colleciton, and in onBefore (which get called by 
crawler threads) i used actual addSeed to load the data.

If this fix is done in crawler4j, it would be great. :) Just wanted to share my 
learnings

Original issue reported on code.google.com by jeba.ride@gmail.com on 15 Apr 2014 at 1:14

GoogleCodeExporter commented 9 years ago
Would it be possible to put some of the code here because i have the same 
problems. Before starting the crawl it takes on some pages ages. 
It would help me a lot thanks.

Original comment by ju...@gmx.net on 11 Jul 2014 at 9:59