StatusUpdaterBolt to use provided domain name for routing

apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

https://stormcrawler.apache.org/

Apache License 2.0

889 stars 260 forks source link

StatusUpdaterBolt to use provided domain name for routing #684

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 5 years ago

StatusUpdaterBolt if configured with routing byDomain should use the routing key from metadata (if provided in the field defined by es.status.routing.fieldname). Updates of the public suffix list (included in the crawler-commons dependency) may change the domain name and routing key, and may cause duplicate status records in the index and needless refetches of the same URL (cf. commoncrawl/news-crawl#28).

The simplest solution is just to use the provided routing key (similar as it's done for routing byIP). This would require only changes in URLPartitioner. Alternatively, StatusUpdaterBolt could check whether the routing key has changed and then send a deletion request using the original routing key and update the status document with the new routing key.

jnioche commented 5 years ago

thanks @sebastian-nagel, get the problem, will think about the best way to implement it

jnioche commented 5 years ago

@sebastian-nagel have put a fix in branch 684, see commit above

you'll need to add es.routing.value to the metadata.persist configuration, even though that particular metadata doesn't actually get stored as it is the only way of passing it to the statusupdaterbolt.

Let me know what you think

sebastian-nagel commented 5 years ago

Thanks, I'll test it during the next days. To pick the _routing value from ES is of course the most reliable solution, maybe better than using metadata.hostname (see es.status.bucket.field resp. es.status.routing.fieldname).

jnioche commented 5 years ago

Am not super happy with this solutiion. It prevents URLS from being refetched forever but means that URLs discovered for the same domain might end up on a different shard. We'd need to extend it to the other backends as well as they might suffer from the same issue. We could add deletion of the old entries but the code would be more complex for an issue which will affect a tiny portion of the users i.e. those who have a long-running crawl + have updated the dependency on crawler-commons either via a new storm-crawler or directly + use domain for sharding.

I also don't like the special case for es.routing.value.

Instead, we could have an external mechanism for reindexing an entire status index. I'll also add a comment in the release note so that people force a version of the domain list or crawler-commons if they upgrade.

what do you think @sebastian-nagel ?

jnioche commented 5 years ago

The reindexing could be done as a topology with the StatusUpdaterBolt, similar to what we do when injecting except that the content would come from a bespoke Spout which would simply read all the content from a shard with a Search Scroll. We'd just need a new spout implementation for this.

sebastian-nagel commented 5 years ago

No question, looks like there is no smart and simple solution. From the maintenance perspective:

not upgrading the public suffix list isn't really an option, there are too many additions over the years
a "self-repairing" status updater (delete and add with new routing key) would be nice
but a tool to do the fix-up once after a software upgrade is ok (if it's fast enough)

Given that it's possible to quickly get a list of domain names (as routing keys or metadata.hostname):

get the domain name for the routing key in strict/validating mode, and if it differs from the original value or is now null (for new public suffixes, e.g. kiev.ua):
get the full list of items and submit a delete and update

I'm ok with documenting the potential problem and for the long term adding some tool to do the fix-up.

sebastian-nagel commented 5 years ago

About "full reindexing": it would still require to send also a deletion. Right?

jnioche commented 5 years ago

the way I see it, it would copy to a new index. Aliasing could be used to preserve a generic name e.g. status if needed. Reindexing could also be useful e.g. for changing the sharding logic or the number of shards etc...

jnioche commented 5 years ago

At the moment the configuration of the status bolt and the spouts is based on the same names e.g. es.status.index.name; we could add a constructor to StatusUpdaterBolt so that it can use a different value for ESBoltType, this way the spouts and the status updater could operate on different indices. The default behaviour would remain the same.