Open asfimport opened 6 years ago
Adrien Grand (@jpountz) (migrated from JIRA)
I agree this sounds wrong. Unfortunately, inserting positions in a token filter is hard to do right if the analysis chain has a preceding token filter that sets synonyms, as you need to fix positions on all paths. This issue touches this problem a bit: #6076.
Holger Bruch (migrated from JIRA)
Ok, seems hard to get right for all cases. I wonder, if the current implementation could work at query time for anyone. However, I‘m working on a fix for HyphenationDecompounderTokenFilter that handles offset, posInc and posLength, though not in case a synonym filter is applied before.
Robert Muir (@rmuir) (migrated from JIRA)
Maybe the right solution is just to fix it correctly and simply enforce input instanceof Tokenizer
? Because its really like an extension of tokenization.
Adrien Grand (@jpountz) (migrated from JIRA)
I'm not sure how practical this would be: some tokenizers today sometimes set the pos inc to 0 I think (JapanesTokenizer?) and it would only allow one of such filters in the analysis chain.
Robert Muir (@rmuir) (migrated from JIRA)
why do you need to decompound more than once? The japanesetokenizer example is the same issue (as it already decompounds)
Adrien Grand (@jpountz) (migrated from JIRA)
I haven't though about concrete use-cases, but for instance I suspect some users perform decompounding using both an algorithm and a dictionary?
Robert Muir (@rmuir) (migrated from JIRA)
Thats what HyphenationDecompoundTokenFilter already does. I think maybe the name is confusing, at least look at the class javadocs :)
In this case I'm sorry but I think you are stretching, (and you aren't correct). We should fix these filters and enforce tokenizer as input, seriously.
Holger Bruch (migrated from JIRA)
I’m not as deeply in Lucene as you are. What would be the pros and cons of ensuring the input is an instance of tokenizer? Would it still be possible to apply a token filters like WDF or lowercase filter before the HyphenationDecompunder?
Adrien Grand (@jpountz) (migrated from JIRA)
No, the hyphenation decompounder would have to be the first token filter in the analysis chain.
HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set positionIncrement to 0 for all subwords, reuse start/endoffset of the original token and ignore positionLength completly.
In consequence, the QueryBuilder generates a SynonymQuery comprising all subwords, which should rather treated as individual terms.
Migrated from LUCENE-8132 by Holger Bruch, updated Jan 23 2018