Prevent StartURL From Being Indexed?

dhildreth commented 6 years ago

Simple question... Is there a way to prevent the startURL from being submitted to the index? Thanks in advance.

I thought maybe I could add it to the RegexReferenceFilter, but that rejects it early in the crawling process and doesn't give it a chance to grab URLs from. Still looking for clever ways...

essiembre commented 6 years ago

There are different ways to do this, as long as it is done AFTER links were extracted (as you discovered) . If you look at this flow diagram, you will see a few filtering options that exist only after "Extract Links".

I recommend you use a document filter in your crawler config, like this:

 <documentFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
        .*your_start_url_pattern.*
    </filter>
 </documentFilters>

The Importer module (<importer>) also contains filtering options.

Please confirm.

dhildreth commented 6 years ago

Yup! That worked great! Thank you.

Norconex / crawlers

Prevent StartURL From Being Indexed? #425