apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
889 stars 260 forks source link

StatusUpdaterBolt to use provided domain name for routing #684

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 5 years ago

StatusUpdaterBolt if configured with routing byDomain should use the routing key from metadata (if provided in the field defined by es.status.routing.fieldname). Updates of the public suffix list (included in the crawler-commons dependency) may change the domain name and routing key, and may cause duplicate status records in the index and needless refetches of the same URL (cf. commoncrawl/news-crawl#28).

The simplest solution is just to use the provided routing key (similar as it's done for routing byIP). This would require only changes in URLPartitioner. Alternatively, StatusUpdaterBolt could check whether the routing key has changed and then send a deletion request using the original routing key and update the status document with the new routing key.

jnioche commented 5 years ago

thanks @sebastian-nagel, get the problem, will think about the best way to implement it

jnioche commented 5 years ago

@sebastian-nagel have put a fix in branch 684, see commit above

you'll need to add es.routing.value to the metadata.persist configuration, even though that particular metadata doesn't actually get stored as it is the only way of passing it to the statusupdaterbolt.

Let me know what you think

sebastian-nagel commented 5 years ago

Thanks, I'll test it during the next days. To pick the _routing value from ES is of course the most reliable solution, maybe better than using metadata.hostname (see es.status.bucket.field resp. es.status.routing.fieldname).

jnioche commented 5 years ago

Am not super happy with this solutiion. It prevents URLS from being refetched forever but means that URLs discovered for the same domain might end up on a different shard. We'd need to extend it to the other backends as well as they might suffer from the same issue. We could add deletion of the old entries but the code would be more complex for an issue which will affect a tiny portion of the users i.e. those who have a long-running crawl + have updated the dependency on crawler-commons either via a new storm-crawler or directly + use domain for sharding.

I also don't like the special case for es.routing.value.

Instead, we could have an external mechanism for reindexing an entire status index. I'll also add a comment in the release note so that people force a version of the domain list or crawler-commons if they upgrade.

what do you think @sebastian-nagel ?

jnioche commented 5 years ago

The reindexing could be done as a topology with the StatusUpdaterBolt, similar to what we do when injecting except that the content would come from a bespoke Spout which would simply read all the content from a shard with a Search Scroll. We'd just need a new spout implementation for this.

sebastian-nagel commented 5 years ago

No question, looks like there is no smart and simple solution. From the maintenance perspective:

Given that it's possible to quickly get a list of domain names (as routing keys or metadata.hostname):

I'm ok with documenting the potential problem and for the long term adding some tool to do the fix-up.

sebastian-nagel commented 5 years ago

About "full reindexing": it would still require to send also a deletion. Right?

jnioche commented 5 years ago

the way I see it, it would copy to a new index. Aliasing could be used to preserve a generic name e.g. status if needed. Reindexing could also be useful e.g. for changing the sharding logic or the number of shards etc...

jnioche commented 5 years ago

At the moment the configuration of the status bolt and the spouts is based on the same names e.g. es.status.index.name; we could add a constructor to StatusUpdaterBolt so that it can use a different value for ESBoltType, this way the spouts and the status updater could operate on different indices. The default behaviour would remain the same.