HHN / crawler4j

Open Source Web Crawler for Java - A fork of yasserg/crawler4j
Apache License 2.0
24 stars 7 forks source link

feature request: retry fetching page #79

Open brbog opened 2 years ago

brbog commented 2 years ago

During tests I observed a couple of times that a fetch failed due to 0 bytes being returned from the server. Since it was not deterministic, a simple "retry" could probably work, but there is currently no way to get that behavior.

The "magic" happens inside the private WebCrawler.processPage()-method. When requesting a retry after fetchResult = pageFetcher.fetchPage(curURL); was performed, the rest of the logic should also still be executed.

brbog commented 2 years ago

Just raising this as a possible improvement for anyone who wants to contribute something :-). Creating a good test for this (using WireMock?) is rather important, but requires some effort I currently can't commit to :-(.