ankurjain0985 / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

HttpResponse response = httpClient.execute(get) in PageFetcher has no Timeout #320

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Do a large crawl certain sites like http://forum.notebookreview.com/
2. Let it crawl as many pages as possible (10000+)
3. After while all threads will get caught stuck in a wait here on 
httpClient.execute(get)

What is the expected output? What do you see instead?
I'm expecting the crawl to finish the site, however what I'm seeing is all my 
threads getting caught in a wait on httpClient.execute(get) leaving my crawl 
forever blocked (shutdown won't work either due to loop logic).

What version of the product are you using?
Using a fork of 3.5, but the code line should apply here as well.

Please provide any additional information below.

I ran into this issue crawling various sites using my own fork of 3.5+ trace 
the problem to httpClient.execute(get) in PageFetcher as all my threads were 
stuck on a wait. Figured this lib would want to catch this too.

httpClient.execute(get) should have a timeout here (configurable or otherwise). 
Here's some more info: 

http://stackoverflow.com/questions/9925113/httpclient-stuck-without-any-exceptio
n

Original issue reported on code.google.com by jordan.b...@gmail.com on 26 Nov 2014 at 7:21

GoogleCodeExporter commented 9 years ago
I actually realize that the PageFetcher is indeed setting timeout here:

    RequestConfig requestConfig = RequestConfig.custom()
        .setExpectContinueEnabled(false)
        .setCookieSpec(CookieSpecs.BROWSER_COMPATIBILITY)
        .setRedirectsEnabled(false)
        .setSocketTimeout(config.getSocketTimeout())
        .setConnectTimeout(config.getConnectionTimeout())
        .build();

Even with these settings I'm still seeing the wait(). However, I did realize my 
client builder did not have the .setConnectionManager(connectionManager) on it, 
so that could be it. 

Feel free to close this, I'll re-open if the problem exists here.

Original comment by jordan.b...@gmail.com on 26 Nov 2014 at 8:02

GoogleCodeExporter commented 9 years ago
Thank you Jordan.

I will close it for now, but please submit a new issue if you think the problem 
still perssists

Original comment by avrah...@gmail.com on 27 Nov 2014 at 9:02