ScottMansfield / widow

Distributed, asynchronous web crawler
GNU Lesser General Public License v2.1
26 stars 4 forks source link

Throttle requests to a domain by total bandwidth in a specified period of time #19

Open ScottMansfield opened 8 years ago

ScottMansfield commented 8 years ago

from @truthpickle via livecoding.tv:

The crawler should be able to keep track of the total amount of bandwidth used per domain and limit to a specified amount in a specified period of time, e.g. 1 GB / month or 400MB / week. The fetch stage can just not retrieve the pages once the limit is passed. When parsing, a little softness can be acceptable, but if the limit is passed too far the page should be dropped from the pipeline.