Open babadofar opened 9 years ago
Hi @babadofar
I've just had a play with it in 2.0 and am seeing the same thing. I haven't used it before, so may be missing something obvious. Requires investigation...
Thanks! Glad to not be the only one with this experience ;) But really, even if the hyphenation decompounder did do what it says, I'm not sure if it would be helpful in regards to splitting up compound words in sensible places. The hyphenation patterns from TeX are meant as guidelines to where words should be split in order to make text flow in a readable manner across the screen. The word parts produced by the hyphenation algorithm do not quite correspond to meaningful words. The dictionary decompounder actually comes closer, with it's brute force logic. To me it seems that the dictionary decompounder would do a great job with Scandinavian compound words with only two additions:
Example for 1:
Dictionary contains: arbeider, arbeid, er, partiet, parti, et
Compound word: arbeiderpartiet
Should be split into: arbeider, partiet
Do not add: er, et, parti
Example for 2:
Dictionary contains: barn, hage, arn, arne, age
Interfix list contains: e, s
Compound word: barnehage
Should be split into: barn, hage
Remove interfix e
Do not add: arn, arne, age
Idea: remove mentions of the hyphenation decompounder from the documentation
@babadofar There is an option for that(Do not add: arn, arne, age): only_longest_match: true But that's not working as well for me. Donno if I am getting it wrong. Also hyphenation_decompounder is not working in my case as well.
@babadofar Found the answer. The token filter in Lucene only supports FOP v1.2 compatible hyphenation patterns. v2.0 and above is not supported. I've updated the docs to link to the supported format: https://github.com/elastic/elasticsearch/commit/9a851a58b95b17250f2c6bb4543370371b193fc4
Hopefully Lucene will add support for later versions, then we will inherit that
I just tried using the hyphenation_decompounder
filter with the v1.2 FOP files and found that it still does nothing (ES logs no errors either), changing the filter type to dictionary_decompounder
made it work.
Info | ... |
---|---|
ES version | 2.2 |
FOP XML file | nl.xml |
Dictionary Words | mes, messen, pot, vis |
my filter config looks like so:
{
"analysis" : {
"filter" : {
"nl_dehyphenate": {
"type" : "dictionary_decompounder",
"word_list_path" : "/path/to/relevant/config/files/nl.txt",
"hyphenation_patterns_path" : "/path/to/relevant/config/files/nl.xml",
"min_word_size" : 6,
"min_subword_size" : 3,
"max_subword_size" : 15
}
}
}
}
possibly (probably?) I'm doing something wrong ... but maybe this is an indication that the hyphenation_decompounder
filter is borked somehow.
I just stumbled over the same issue. Seems to be caused by a byte shift issue in org.apache.lucene.analysis.compound.hyphenation.HyphenationTree.getValue() where line
char c = (char) ((v >>> 4) - 1);
should read
char c = (char) (((v & 0xff) >>> 4) - 1);
because otherwise the hyphenation indicator may be restored as negative value.
However, hyphenation indicators 1-6 should work, so chaplin6pris would do the job.
I suggest to reopen this issue until fixed by lucene.
See also issue LUCENE-8124
This will be fixed when we upgrade to lucene 7.3
cc @elastic/es-search-aggs
Elasticsearch 6.4 is using Lucene 7.4.0 but I still can't get the hyphenation_decompounder token filter to work.
I tried it with the German hyphenation grammar rule file de_DR.xml
and also by manually creating a file containing chaplin6pris
, but no luck with both of them.
^^ @clintongormley
In Elasticsearch 7.4.0/Lucene 8.2.0 this is still not working.
Pinging @elastic/es-search (Team:Search)
Pinging @elastic/es-search-relevance (Team:Search Relevance)
Hi, I'm trying to use the hyphenation decompounder as described here to split the Norwegian word "chaplinpris" into its two word-parts "chaplin" and "pris", but having no luck: Following instructions as put down in this page: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-compound-word-tokenfilter.html
I handcrafted my own hyphenation pattern file:
The 7 between chaplin and pris should mean that this place is highly eligible for a word-split. (some documentation for the patterns found here http://xmlgraphics.apache.org/fop/1.1/hyphenation.html#patterns)
The analysis settings:
Testing analysis
returns no splits:
Using the same analysis setting just replacing type hyphenation decompounder with the dictionary decompounder works fine. So it seems the hyphenation decompounder doesn't do what it's supposed to?