I tried to get rid of the **tiff_BitsPerSample** field by deleting it before it is sent to Apache Solr

mitchelljj commented 9 years ago

I get the below Apache Solr log error: ERROR - 2015-09-26 22:18:51.914; [c:gettingstarted s:shard2 r:core_node2 x:gettingstarted_shard2_replica1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ERROR: [doc=http://www2.ed.gov/programs/iegpsddrap/brochure-ddra.doc] Error adding field 'tiff_BitsPerSample'='8 8 8 8' msg=For input string: "8 8 8 8" at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:176)

I tried to get rid of the tiff_BitsPerSample field by deleting it before it is sent to Apache Solr by adding the below to the Norconex minimum-config.xml file with the following for the "tagger class="com.norconex.importer.handler.tagger.impl.DeleteTagger" tag and I even tried all lower case in addition to how the field is reported as an error to Solr:

**tiff_BitsPerSample**,tiff_bitspersample

I even stopped Solr and started again and then started the Norconex crawl but Apache Solr log file is still reporting that the tiff_BitsPerSample field is causing errors. How can I prevent this tiff_BitsPerSample field from being imported to Solr and causing these errors? Do I need to start to the very beginning and reset the Solr environment back to the starting point like I have listed below?

The following command line will stop Solr and remove the directories for each of the two nodes that the start script created: bin/solr stop -all ; rm -Rf example/cloud/ adding back the initial cloud gettingstarted environment: To launch Solr, run: bin/solr start -e cloud –noprompt

essiembre commented 9 years ago

Can you please attach your config?

essiembre commented 9 years ago

It may be a case where your DeleteTagger is set before parsing occurs. Make sure to configure the DeleteTagger in the <postParseHandlers> section of your Importer configuration.

Alternatively, it may be simpler (and safer) to use the KeepOnlyTagger instead (still as a post-parse handler). This way if a web site decides to add new meta data fields to their pages, they will not make it through to Solr.

essiembre commented 9 years ago

Have you resolved your issue with the last suggestions I made?

essiembre commented 9 years ago

Having received no feedback in a while on the latest suggestion, I am closing this, assuming it worked for you. You can reopen if need be.

Norconex / crawlers

I tried to get rid of the tiff_BitsPerSample field by deleting it before it is sent to Apache Solr #153

Norconex / crawlers

I tried to get rid of the **tiff_BitsPerSample** field by deleting it before it is sent to Apache Solr #153

I tried to get rid of the tiff_BitsPerSample field by deleting it before it is sent to Apache Solr #153