MichaelAquilina / Reddit-Recommender-Bot

Indentifying Interesting Documents for Reddit using Recommender Techniques
7 stars 0 forks source link

Create a webcrawler that follows standard conventions #86

Closed MichaelAquilina closed 10 years ago

MichaelAquilina commented 10 years ago

Requirements:

MichaelAquilina commented 10 years ago

Avoiding the terms specified in Robots.txt can result in your IP address from being banned which is obviously counter productive. It seems a good metric to use is 1 page / 1 second for each domain. This means the webcrawler should stagger threads so as to access another domain while waiting for the 1 second wait to complete on another.