Closed mitchelljj closed 9 years ago
This field is useful if you want to keep track of which pages in your site references whatever other pages. You can very easily control what gets into Solr in a few different ways. Look at these Importer handlers:
KeepOnlyTagger This class allows you specify exactly what to keep. It will drop everything else.
DeleteTagger This one performs the opposite: you tell it which fields you do NOT want to keep. Everything else will go through.
RenameTagger This one allows you to rename fields to match whatever Solr names you prefer.
I recommend you use the above suggestions as a post-parse import handler. Like this:
<httpcollector id="My Collector">
<crawlers>
<crawler id="My Crawler">
...
<importer>
...
<postParseHandlers>
...
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>document.reference, keywords, description, whatever.else</fields>
<fieldsRegex>myCustomField.*</fieldsRegex>
</tagger>
...
</postParseHandlers>
...
</importer>
...
</crawler>
...
</crawlers>
</httpcollector>
I have modified the minimum example to crawl against my website (see below command) and one of the the fields that it displays is called "collector.referenced-urls" which contains many links which I am not interested in indexing into Apache Solr. Is their a way to prevent this field from being included with the Apache Solr index?
Thanks,
John
[jmitchell@rtpadtwwwd03 norconex-collector-http-2.2.1]$ /home/jmitchell/20150905/norconex-collector-http-2.2.1/collector-http.sh -a start -c examples/minimum/minimum-config.xml