Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Option to parse links when web page has been rejected #3

Closed stejacob closed 11 years ago

stejacob commented 11 years ago

When crawling web pages, it would useful to provide an option in the XML file to parse links for rejected pages.

essiembre commented 11 years ago

This is already supported. Any filtering done BEFORE the URLs extraction phase for a document will prevent URLs from being extracted since the document won't even be downloaded. If you want to reject a file after it has been downloaded an its URLs have been extracted, simply filter it AFTER the URL extraction phase. To see your filtering options, look at what is available to you before and after urlExtractor here: http://www.norconex.com/product/collector-http/examples/collector-http-config-reference.xml

essiembre commented 11 years ago

To be more precise, you can look at using httpDocumentFilters or one of the Importer module filter tags.