Option to parse links when web page has been rejected

stejacob commented 11 years ago

When crawling web pages, it would useful to provide an option in the XML file to parse links for rejected pages.

essiembre commented 11 years ago

This is already supported. Any filtering done BEFORE the URLs extraction phase for a document will prevent URLs from being extracted since the document won't even be downloaded. If you want to reject a file after it has been downloaded an its URLs have been extracted, simply filter it AFTER the URL extraction phase. To see your filtering options, look at what is available to you before and after urlExtractor here: http://www.norconex.com/product/collector-http/examples/collector-http-config-reference.xml

essiembre commented 11 years ago

To be more precise, you can look at using httpDocumentFilters or one of the Importer module filter tags.

Norconex / crawlers

Option to parse links when web page has been rejected #3