Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

LanguageTagger keepProbabilities="true" and ElasticsearchCommitter compatibility #518

Closed rustyx closed 6 years ago

rustyx commented 6 years ago

With LanguageTagger's keepProbabilities="true" I'm unable to successfully index document into ElasticSearch. Is there a way to do it? The mapping for document.language can be either text or object, not both. How to configure ES index to accept text values for document.language and document.language.1.probability?

Right now I'm getting:

"error": { "type": "mapper_parsing_exception", "reason": "Could not dynamically add mapping for field [document.language.1.probability]. Existing mapping for [document.language] must be of type object but found [text]." }

essiembre commented 6 years ago

Elasticsearch considers the dot in field names to be an object path. So you have to make sure you are not sending your non-object fields with dots. You can do so using the RenameTagger in the Importer configuration section, or even simpler, you can tell the Elasticsearch Committer to replace the dots in field names with whatever value before sending your documents.

<dotReplacement>_</dotReplacement>

I suggest you have a look at the Committer configuration page for more options. You may be interested in other settings such as jsonFieldsPattern and fixBadIds.

rustyx commented 6 years ago

Yes there are various workarounds, but wouldn't it be easier to just use a different field name for probabilities? For example, document.languages.

It is also desirable to have the probabilities in a single field.

My current workaround looks like this:

<tagger class="${handler}.tagger.impl.MergeTagger">
  <merge toField="document.languages" singleValue="true" singleValueSeparator=",">
    <fromFields>document.language.1.tag,document.language.1.probability,
        document.language.2.tag,document.language.2.probability,
        document.language.3.tag,document.language.3.probability</fromFields>
  </merge>
</tagger>
<tagger class="${handler}.tagger.impl.DeleteTagger">
  <fromFieldsRegex>document\.language\..*</fromFieldsRegex>
</tagger>

I will close this for now since I have a workaround.