Huge throughput improvement

GoogleCodeExporter commented 9 years ago

Currently it is one thread doing fetch of a single page and while that fetch is 
in progress it is doing nothing. 

We could make it significantly more efficient, by moving to async IO where we 
submit a large number of requests and once any of them is ready we would 
process them. 

We would also need to have separate threads for submitting fetch requests, 
parsing html pages... 

This is a major architecture change.

Original issue reported on code.google.com by avrah...@gmail.com on 10 Aug 2014 at 12:06

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:51

Changed state: Accepted
Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

For reference:
https://code.google.com/p/crawler4j/issues/detail?id=61

Original comment by avrah...@gmail.com on 28 Aug 2014 at 5:14

asepaprianto / crawler4j

Huge throughput improvement #271