crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

RocksDB backend - faster restarts #54

Closed jnioche closed 2 years ago

jnioche commented 2 years ago

When the RocksDB-based service gets restarted, it can take a substantial amount of time as it needs to go through its table to rebuild the information about the queues, namely the number of active URLs they contain and number of URLs already processed. What we could do instead (in case of a polite and clean termination) would be to populate a table containing the queue names as well as these counts. When restarting, if such a table exists, it would be only a matter of reading the data from it instead of going through the whole URL table. Once read, the table would be deleted. In case of a crash, such a table would not be written at all and we would rely on the existing mechanism.

jnioche commented 2 years ago

before 333049 queues discovered in 61688 msec after 333049 queues discovered in 1695 msec