elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.56k stars 24.62k forks source link

Hyphenation decompounder malfunction? #13935

Open babadofar opened 8 years ago

babadofar commented 8 years ago

Hi, I'm trying to use the hyphenation decompounder as described here to split the Norwegian word "chaplinpris" into its two word-parts "chaplin" and "pris", but having no luck: Following instructions as put down in this page: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-compound-word-tokenfilter.html

I handcrafted my own hyphenation pattern file:

<hyphenation-info>
<hyphen-min before="2" after="2"/>
<patterns>
chaplin7pris
</patterns>
</hyphenation-info>

The 7 between chaplin and pris should mean that this place is highly eligible for a word-split. (some documentation for the patterns found here http://xmlgraphics.apache.org/fop/1.1/hyphenation.html#patterns)

The analysis settings:

PUT / wiki / _settings 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "hyph_decoumpound_list": {
                    "tokenizer": "standard",
                    "filter": [
                        "standard",
                        "lowercase",
                        "hyph_decompound_list"
                    ]
                }
            },
            "filter": {
                "hyph_decompound_list": {
                    "type": "hyphenation_decompounder",
                    "word_list": ["pris", "chaplin"],
                    "hyphenation_patterns_path": "config/lang/hyphenation/test.xml"
                }
            }
        }
    }
}

Testing analysis

GET /wiki/_analyze?analyzer=hyph_decoumpound_list&text=chaplinpris

returns no splits:

{
   "tokens": [
      {
         "token": "chaplinpris",
         "start_offset": 0,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 1
      }
   ]
}

Using the same analysis setting just replacing type hyphenation decompounder with the dictionary decompounder works fine. So it seems the hyphenation decompounder doesn't do what it's supposed to?

clintongormley commented 8 years ago

Hi @babadofar

I've just had a play with it in 2.0 and am seeing the same thing. I haven't used it before, so may be missing something obvious. Requires investigation...

babadofar commented 8 years ago

Thanks! Glad to not be the only one with this experience ;) But really, even if the hyphenation decompounder did do what it says, I'm not sure if it would be helpful in regards to splitting up compound words in sensible places. The hyphenation patterns from TeX are meant as guidelines to where words should be split in order to make text flow in a readable manner across the screen. The word parts produced by the hyphenation algorithm do not quite correspond to meaningful words. The dictionary decompounder actually comes closer, with it's brute force logic. To me it seems that the dictionary decompounder would do a great job with Scandinavian compound words with only two additions:

  1. Don't try to find all possible words. Try to match enough words to completely fill the original, nothing more or less. Except...
  2. Allow a (short) predefined list of interfixes that may be found in between sub-words.

Example for 1:

Dictionary contains: arbeider, arbeid, er, partiet, parti, et Compound word: arbeiderpartiet Should be split into: arbeider, partiet Do not add: er, et, parti

Example for 2:

Dictionary contains: barn, hage, arn, arne, age Interfix list contains: e, s Compound word: barnehage Should be split into: barn, hage Remove interfix e Do not add: arn, arne, age

babadofar commented 8 years ago

Idea: remove mentions of the hyphenation decompounder from the documentation

udit7590 commented 8 years ago

@babadofar There is an option for that(Do not add: arn, arne, age): only_longest_match: true But that's not working as well for me. Donno if I am getting it wrong. Also hyphenation_decompounder is not working in my case as well.

clintongormley commented 8 years ago

@babadofar Found the answer. The token filter in Lucene only supports FOP v1.2 compatible hyphenation patterns. v2.0 and above is not supported. I've updated the docs to link to the supported format: https://github.com/elastic/elasticsearch/commit/9a851a58b95b17250f2c6bb4543370371b193fc4

clintongormley commented 8 years ago

Hopefully Lucene will add support for later versions, then we will inherit that

iamjochem commented 7 years ago

I just tried using the hyphenation_decompounder filter with the v1.2 FOP files and found that it still does nothing (ES logs no errors either), changing the filter type to dictionary_decompounder made it work.

Info ...
ES version 2.2
FOP XML file nl.xml
Dictionary Words mes, messen, pot, vis

my filter config looks like so:

{
    "analysis"    : {
        "filter"        : {
            "nl_dehyphenate": {
                "type"                      : "dictionary_decompounder",
                "word_list_path"            : "/path/to/relevant/config/files/nl.txt",
                "hyphenation_patterns_path" : "/path/to/relevant/config/files/nl.xml",
                "min_word_size"             : 6,
                "min_subword_size"          : 3,
                "max_subword_size"          : 15
            }
        }
    }
}

possibly (probably?) I'm doing something wrong ... but maybe this is an indication that the hyphenation_decompounder filter is borked somehow.

hbruch commented 6 years ago

I just stumbled over the same issue. Seems to be caused by a byte shift issue in org.apache.lucene.analysis.compound.hyphenation.HyphenationTree.getValue() where line

char c = (char) ((v >>> 4) - 1);

should read

char c = (char) (((v & 0xff) >>> 4) - 1);

because otherwise the hyphenation indicator may be restored as negative value.

However, hyphenation indicators 1-6 should work, so chaplin6pris would do the job.

I suggest to reopen this issue until fixed by lucene.

hbruch commented 6 years ago

See also issue LUCENE-8124

romseygeek commented 6 years ago

This will be fixed when we upgrade to lucene 7.3

cc @elastic/es-search-aggs

dschneiter commented 6 years ago

Elasticsearch 6.4 is using Lucene 7.4.0 but I still can't get the hyphenation_decompounder token filter to work. I tried it with the German hyphenation grammar rule file de_DR.xml and also by manually creating a file containing chaplin6pris, but no luck with both of them. ^^ @clintongormley

ddeboer commented 4 years ago

In Elasticsearch 7.4.0/Lucene 8.2.0 this is still not working.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)