apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
883 stars 261 forks source link

Allow other links to be considered outlinks #267

Closed mattburns closed 8 years ago

mattburns commented 8 years ago

Our crawler is focussed on finding images. It would be nice if it were possible to optimise for this.

Currently if the following html is parsed, only image1.jpg will be added to the status queue:

<a href="image1.jpg">
    <img src="image1-thumb.jpg"/>
</a>
<img src="image2.jpg"/>

It would be nice if I could specify a config, or a custom parse filter, etc so that all images found in the img[src] attribute are added to the status queue as well so that the above example added all three images to the outlinks.

jnioche commented 8 years ago

Could be done as a ParseFilter with the XPath expressions specified in the YAML config. This would have the advantage of working for JSoup but also potentially for the Tika based parser.

jnioche commented 8 years ago

Related [http://stackoverflow.com/questions/36722755/crawling-video-with-apache-nutch]