It would be nice if I could specify a config, or a custom parse filter, etc so that all images found in the img[src] attribute are added to the status queue as well so that the above example added all three images to the outlinks.
Could be done as a ParseFilter with the XPath expressions specified in the YAML config. This would have the advantage of working for JSoup but also potentially for the Tika based parser.
Our crawler is focussed on finding images. It would be nice if it were possible to optimise for this.
Currently if the following html is parsed, only
image1.jpg
will be added to the status queue:It would be nice if I could specify a config, or a custom parse filter, etc so that all images found in the
img[src]
attribute are added to the status queue as well so that the above example added all three images to the outlinks.