Closed dhildreth closed 6 years ago
There are different ways to do this, as long as it is done AFTER links were extracted (as you discovered) . If you look at this flow diagram, you will see a few filtering options that exist only after "Extract Links".
I recommend you use a document filter in your crawler config, like this:
<documentFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
.*your_start_url_pattern.*
</filter>
</documentFilters>
The Importer module (<importer>
) also contains filtering options.
Please confirm.
Yup! That worked great! Thank you.
Simple question... Is there a way to prevent the startURL from being submitted to the index? Thanks in advance.
I thought maybe I could add it to the RegexReferenceFilter, but that rejects it early in the crawling process and doesn't give it a chance to grab URLs from. Still looking for clever ways...