Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Is their a way to prevent a field from being included with the Apache Solr index? #147

Closed mitchelljj closed 9 years ago

mitchelljj commented 9 years ago

I have modified the minimum example to crawl against my website (see below command) and one of the the fields that it displays is called "collector.referenced-urls" which contains many links which I am not interested in indexing into Apache Solr. Is their a way to prevent this field from being included with the Apache Solr index?

Thanks,

John

[jmitchell@rtpadtwwwd03 norconex-collector-http-2.2.1]$ /home/jmitchell/20150905/norconex-collector-http-2.2.1/collector-http.sh -a start -c examples/minimum/minimum-config.xml

essiembre commented 9 years ago

This field is useful if you want to keep track of which pages in your site references whatever other pages. You can very easily control what gets into Solr in a few different ways. Look at these Importer handlers:

KeepOnlyTagger This class allows you specify exactly what to keep. It will drop everything else.

DeleteTagger This one performs the opposite: you tell it which fields you do NOT want to keep. Everything else will go through.

RenameTagger This one allows you to rename fields to match whatever Solr names you prefer.

I recommend you use the above suggestions as a post-parse import handler. Like this:

<httpcollector id="My Collector">
    <crawlers>
        <crawler id="My Crawler">
            ...
            <importer>
                ...
                <postParseHandlers>
                    ...
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                        <fields>document.reference, keywords, description, whatever.else</fields>
                        <fieldsRegex>myCustomField.*</fieldsRegex>
                    </tagger>
                    ...
                </postParseHandlers>
                ...
            </importer>
            ...
        </crawler>
        ...
    </crawlers>
</httpcollector>