Closed stejacob closed 11 years ago
This is already supported. Any filtering done BEFORE the URLs extraction phase for a document will prevent URLs from being extracted since the document won't even be downloaded. If you want to reject a file after it has been downloaded an its URLs have been extracted, simply filter it AFTER the URL extraction phase. To see your filtering options, look at what is available to you before and after urlExtractor
here: http://www.norconex.com/product/collector-http/examples/collector-http-config-reference.xml
To be more precise, you can look at using httpDocumentFilters
or one of the Importer module filter
tags.
When crawling web pages, it would useful to provide an option in the XML file to parse links for rejected pages.