crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
46 stars 12 forks source link

Remove lock in putURLs for RocksDB service #28

Closed jnioche closed 3 years ago

jnioche commented 3 years ago

putURLs is vital in terms of performance as updates and additions to the frontier are done continuously, in a streaming fashion and due to the nature of crawling is done a lot of time.

The implementation in 0.2 uses a monitor on the queues map, originally with the intent to prevent adding to a queue while it is being deleted. Queue deletions happen very infrequently (at least compared to putting URLs) but having this lock means that even when no queue deletion is happening multiple threads block each other within putURLs which is completely unnecessary.

The profiler I am using on a crawl is showing that this is happening millions of times for each thread and wasting hundreds of seconds.

Instead of using a monitor, the deleteQueue method could simply put the queue being deleted in a map - with the added benefit that the operation would be done just once even if the method is called twice and the putURLs would only have to check whether the current queue is being deleted and skip the URL if this is the case. This also makes more sense as far as logic is concerned.