crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

Multithread reading from queues #41

Closed jnioche closed 1 year ago

jnioche commented 2 years ago

It can currently take a bit of time for the service to submit URLs from queues when the number of URLs gets large. This is due partly to fact that a call to the getURLs endpoint iterates sequentially on the queues; retrieving URs for a queue takes longer and longer. This is not noticeable early in a crawl but becomes more of an issue as the frontier grows. Looking at the CPU usage of the Frontier, it has only one or two cores busy. If we have a pool of threads getting candidates from the queues in parallel, we'd be able to mobilise more of the CPUs and make the operation faster.