crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

ShardedRocksDBService does not return ack when identical URLs are sent in short succession #67

Closed jnioche closed 2 years ago

jnioche commented 2 years ago

This is due to

https://github.com/crawler-commons/url-frontier/blob/master/service/src/main/java/crawlercommons/urlfrontier/service/cluster/DistributedFrontierService.java#L141

Either the cache should have a list of all the incoming messages associated with a URL or - simpler option - block if a URL is about to be sent to an external Frontier but we already have something being processed for it.

jnioche commented 2 years ago

also fixed a situation where the original stream from the client had been closed but the remote frontier had not had time to finish its work