Closed MichaelAquilina closed 10 years ago
Avoiding the terms specified in Robots.txt can result in your IP address from being banned which is obviously counter productive. It seems a good metric to use is 1 page / 1 second for each domain. This means the webcrawler should stagger threads so as to access another domain while waiting for the 1 second wait to complete on another.
Requirements: