Is their a way to prevent a field from being included with the Apache Solr index?

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

Apache License 2.0

183 stars 67 forks source link

This field is useful if you want to keep track of which pages in your site references whatever other pages. You can very easily control what gets into Solr in a few different ways. Look at these Importer handlers:

KeepOnlyTagger This class allows you specify exactly what to keep. It will drop everything else.

DeleteTagger This one performs the opposite: you tell it which fields you do NOT want to keep. Everything else will go through.

RenameTagger This one allows you to rename fields to match whatever Solr names you prefer.

I recommend you use the above suggestions as a post-parse import handler. Like this:

<httpcollector id="My Collector">
    <crawlers>
        <crawler id="My Crawler">
            ...
            <importer>
                ...
                <postParseHandlers>
                    ...
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                        <fields>document.reference, keywords, description, whatever.else</fields>
                        <fieldsRegex>myCustomField.*</fieldsRegex>
                    </tagger>
                    ...
                </postParseHandlers>
                ...
            </importer>
            ...
        </crawler>
        ...
    </crawlers>
</httpcollector>

Norconex / crawlers

Is their a way to prevent a field from being included with the Apache Solr index? #147