Allow other links to be considered outlinks

mattburns commented 8 years ago

Our crawler is focussed on finding images. It would be nice if it were possible to optimise for this.

Currently if the following html is parsed, only image1.jpg will be added to the status queue:

<a href="image1.jpg">
    <img src="image1-thumb.jpg"/>
</a>
<img src="image2.jpg"/>

It would be nice if I could specify a config, or a custom parse filter, etc so that all images found in the img[src] attribute are added to the status queue as well so that the above example added all three images to the outlinks.

jnioche commented 8 years ago

Could be done as a ParseFilter with the XPath expressions specified in the YAML config. This would have the advantage of working for JSoup but also potentially for the Tika based parser.

jnioche commented 8 years ago

apache / incubator-stormcrawler

Allow other links to be considered outlinks #267