crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

Ability to set Limit on domain #90

Closed zaibacu closed 1 month ago

zaibacu commented 2 months ago

We have business use case where we need want to crawl only up to certain limit of Urls per domain. After domain becomes refetchable, just reset the counter and fetch again.

As far as I can see, there's no option to do that currently? I'm willing to contribute custom code for this, just want to make sure this is OK with overall design of the library, or maybe there's already option to config that?

jnioche commented 2 months ago

hi @zaibacu One option would be to externally track the number of URLs already fetched for a queue and block it with BlockQueueUntil. it currently can't be done within URLFrontier. If you were to do that, you could use the getCountCompleted info within a queue.

It could be an interesting feature, happy to discuss it further if you are open to contributing it

zaibacu commented 2 months ago

Thank you! I think I'll start with BlockQueueUntil since I need quick solution, and later come back with feature for UrlFrontier itself

jnioche commented 1 month ago

Fixed in #91