The crawler should be able to keep track of the total amount of bandwidth used per domain and limit to a specified amount in a specified period of time, e.g. 1 GB / month or 400MB / week. The fetch stage can just not retrieve the pages once the limit is passed. When parsing, a little softness can be acceptable, but if the limit is passed too far the page should be dropped from the pipeline.
from @truthpickle via livecoding.tv:
The crawler should be able to keep track of the total amount of bandwidth used per domain and limit to a specified amount in a specified period of time, e.g. 1 GB / month or 400MB / week. The fetch stage can just not retrieve the pages once the limit is passed. When parsing, a little softness can be acceptable, but if the limit is passed too far the page should be dropped from the pipeline.