Closed sebastian-nagel closed 5 years ago
thanks @sebastian-nagel, get the problem, will think about the best way to implement it
@sebastian-nagel have put a fix in branch 684, see commit above
you'll need to add es.routing.value to the metadata.persist configuration, even though that particular metadata doesn't actually get stored as it is the only way of passing it to the statusupdaterbolt.
Let me know what you think
Thanks, I'll test it during the next days. To pick the _routing
value from ES is of course the most reliable solution, maybe better than using metadata.hostname
(see es.status.bucket.field
resp. es.status.routing.fieldname
).
Am not super happy with this solutiion. It prevents URLS from being refetched forever but means that URLs discovered for the same domain might end up on a different shard. We'd need to extend it to the other backends as well as they might suffer from the same issue. We could add deletion of the old entries but the code would be more complex for an issue which will affect a tiny portion of the users i.e. those who have a long-running crawl + have updated the dependency on crawler-commons either via a new storm-crawler or directly + use domain for sharding.
I also don't like the special case for es.routing.value.
Instead, we could have an external mechanism for reindexing an entire status index. I'll also add a comment in the release note so that people force a version of the domain list or crawler-commons if they upgrade.
what do you think @sebastian-nagel ?
The reindexing could be done as a topology with the StatusUpdaterBolt, similar to what we do when injecting except that the content would come from a bespoke Spout which would simply read all the content from a shard with a Search Scroll. We'd just need a new spout implementation for this.
No question, looks like there is no smart and simple solution. From the maintenance perspective:
Given that it's possible to quickly get a list of domain names (as routing keys or metadata.hostname
):
kiev.ua
):I'm ok with documenting the potential problem and for the long term adding some tool to do the fix-up.
About "full reindexing": it would still require to send also a deletion. Right?
the way I see it, it would copy to a new index. Aliasing could be used to preserve a generic name e.g. status if needed. Reindexing could also be useful e.g. for changing the sharding logic or the number of shards etc...
At the moment the configuration of the status bolt and the spouts is based on the same names e.g. es.status.index.name; we could add a constructor to StatusUpdaterBolt so that it can use a different value for ESBoltType, this way the spouts and the status updater could operate on different indices. The default behaviour would remain the same.
StatusUpdaterBolt if configured with routing
byDomain
should use the routing key from metadata (if provided in the field defined byes.status.routing.fieldname
). Updates of the public suffix list (included in the crawler-commons dependency) may change the domain name and routing key, and may cause duplicate status records in the index and needless refetches of the same URL (cf. commoncrawl/news-crawl#28).The simplest solution is just to use the provided routing key (similar as it's done for routing
byIP
). This would require only changes in URLPartitioner. Alternatively, StatusUpdaterBolt could check whether the routing key has changed and then send a deletion request using the original routing key and update the status document with the new routing key.