Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Prevent StartURL From Being Indexed? #425

Closed dhildreth closed 6 years ago

dhildreth commented 6 years ago

Simple question... Is there a way to prevent the startURL from being submitted to the index? Thanks in advance.

I thought maybe I could add it to the RegexReferenceFilter, but that rejects it early in the crawling process and doesn't give it a chance to grab URLs from. Still looking for clever ways...

essiembre commented 6 years ago

There are different ways to do this, as long as it is done AFTER links were extracted (as you discovered) . If you look at this flow diagram, you will see a few filtering options that exist only after "Extract Links".

I recommend you use a document filter in your crawler config, like this:

 <documentFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
        .*your_start_url_pattern.*
    </filter>
 </documentFilters>

The Importer module (<importer>) also contains filtering options.

Please confirm.

dhildreth commented 6 years ago

Yup! That worked great! Thank you.