Many URLs are discarded / not processed(missing in output)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. I am running crawler4j to find status(http response) code for one million 
URLs.
2. I get proper response for 90% URLs, but 10% are missing in the output.

What is the expected output? What do you see instead?
They don't even appear in handlePageStatusCode() method of Webcrawler extended 
class.
Probably they are not processed due to various issues.
Is it possible to find those missing URLs to reprocess?
Can we improve the crawling process not to miss any of the URLs ?

What version of the product are you using?
3.5

Please provide any additional information below.

Original issue reported on code.google.com by abhi8...@gmail.com on 16 Feb 2014 at 11:32

GoogleCodeExporter commented 9 years ago

Better logging will be incorporated into the crawler in the near future.

Those will enable us not to miss any dropped url.

Original comment by avrah...@gmail.com on 11 Aug 2014 at 11:10

Changed state: Accepted
Added labels: Priority-High, Type-Enhancement
Removed labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

In the latest commits we have added much more logging and even several hook 
methods which you can override in your controller which will give you complete 
control over all of those URLs which were dropped due to various reasons.

Fixes can be found in revision hashes:
384d98113c86    
6bdccbd254d0

Original comment by avrah...@gmail.com on 17 Aug 2014 at 5:48

Changed state: Fixed

asepaprianto / crawler4j

Many URLs are discarded / not processed(missing in output) #255