Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

I tried to get rid of the **tiff_BitsPerSample** field by deleting it before it is sent to Apache Solr #153

Closed mitchelljj closed 9 years ago

mitchelljj commented 9 years ago

I get the below Apache Solr log error: ERROR - 2015-09-26 22:18:51.914; [c:gettingstarted s:shard2 r:core_node2 x:gettingstarted_shard2_replica1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ERROR: [doc=http://www2.ed.gov/programs/iegpsddrap/brochure-ddra.doc] Error adding field 'tiff_BitsPerSample'='8 8 8 8' msg=For input string: "8 8 8 8" at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:176)

I tried to get rid of the tiff_BitsPerSample field by deleting it before it is sent to Apache Solr by adding the below to the Norconex minimum-config.xml file with the following for the "tagger class="com.norconex.importer.handler.tagger.impl.DeleteTagger" tag and I even tried all lower case in addition to how the field is reported as an error to Solr:

**tiff_BitsPerSample**,tiff_bitspersample

I even stopped Solr and started again and then started the Norconex crawl but Apache Solr log file is still reporting that the tiff_BitsPerSample field is causing errors. How can I prevent this tiff_BitsPerSample field from being imported to Solr and causing these errors? Do I need to start to the very beginning and reset the Solr environment back to the starting point like I have listed below?

The following command line will stop Solr and remove the directories for each of the two nodes that the start script created: bin/solr stop -all ; rm -Rf example/cloud/ adding back the initial cloud gettingstarted environment: To launch Solr, run: bin/solr start -e cloud –noprompt

essiembre commented 9 years ago

Can you please attach your config?

essiembre commented 9 years ago

It may be a case where your DeleteTagger is set before parsing occurs. Make sure to configure the DeleteTagger in the <postParseHandlers> section of your Importer configuration.

Alternatively, it may be simpler (and safer) to use the KeepOnlyTagger instead (still as a post-parse handler). This way if a web site decides to add new meta data fields to their pages, they will not make it through to Solr.

essiembre commented 9 years ago

Have you resolved your issue with the last suggestions I made?

essiembre commented 9 years ago

Having received no feedback in a while on the latest suggestion, I am closing this, assuming it worked for you. You can reopen if need be.