It can currently take a bit of time for the service to submit URLs from queues when the number of URLs gets large. This is due partly to fact that a call to the getURLs endpoint iterates sequentially on the queues; retrieving URs for a queue takes longer and longer. This is not noticeable early in a crawl but becomes more of an issue as the frontier grows.
Looking at the CPU usage of the Frontier, it has only one or two cores busy. If we have a pool of threads getting candidates from the queues in parallel, we'd be able to mobilise more of the CPUs and make the operation faster.
It can currently take a bit of time for the service to submit URLs from queues when the number of URLs gets large. This is due partly to fact that a call to the getURLs endpoint iterates sequentially on the queues; retrieving URs for a queue takes longer and longer. This is not noticeable early in a crawl but becomes more of an issue as the frontier grows. Looking at the CPU usage of the Frontier, it has only one or two cores busy. If we have a pool of threads getting candidates from the queues in parallel, we'd be able to mobilise more of the CPUs and make the operation faster.