crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

Faster recovery for RocksDB service implementation #52

Closed jnioche closed 2 years ago

jnioche commented 2 years ago

In the versions <2.1, the recovery of data when restarting RocksDB is pretty slow. This is due to the fact that it reads all the data from both tables in order to check that the count of active URLs in the queues table matches what is found in the default table without a value (i.e. no refetching planned for it).

This is not strictly necessary and it is possible to regenerate the info from the queues only by reading the default table. The check is now optional and triggered by the config rocksdb.recovery.check.

A test on a small crawl showed a reduction in recovery time from 2.8s to 1.6s.