HyphenationDecompoundTokenFilter does not set position/offset attributes correctly [LUCENE-8132]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.74k stars 1.04k forks source link

HyphenationDecompoundTokenFilter does not set position/offset attributes correctly [LUCENE-8132] #9180

Open asfimport opened 6 years ago

asfimport commented 6 years ago

HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set positionIncrement to 0 for all subwords, reuse start/endoffset of the original token and ignore positionLength completly.

In consequence, the QueryBuilder generates a SynonymQuery comprising all subwords, which should rather treated as individual terms.

Migrated from LUCENE-8132 by Holger Bruch, updated Jan 23 2018

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I agree this sounds wrong. Unfortunately, inserting positions in a token filter is hard to do right if the analysis chain has a preceding token filter that sets synonyms, as you need to fix positions on all paths. This issue touches this problem a bit: #6076.

asfimport commented 6 years ago

Holger Bruch (migrated from JIRA)

Ok, seems hard to get right for all cases. I wonder, if the current implementation could work at query time for anyone. However, I‘m working on a fix for HyphenationDecompounderTokenFilter that handles offset, posInc and posLength, though not in case a synonym filter is applied before.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Maybe the right solution is just to fix it correctly and simply enforce input instanceof Tokenizer? Because its really like an extension of tokenization.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I'm not sure how practical this would be: some tokenizers today sometimes set the pos inc to 0 I think (JapanesTokenizer?) and it would only allow one of such filters in the analysis chain.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

why do you need to decompound more than once? The japanesetokenizer example is the same issue (as it already decompounds)

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I haven't though about concrete use-cases, but for instance I suspect some users perform decompounding using both an algorithm and a dictionary?

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thats what HyphenationDecompoundTokenFilter already does. I think maybe the name is confusing, at least look at the class javadocs :)

In this case I'm sorry but I think you are stretching, (and you aren't correct). We should fix these filters and enforce tokenizer as input, seriously.

asfimport commented 6 years ago

Holger Bruch (migrated from JIRA)

I’m not as deeply in Lucene as you are. What would be the pros and cons of ensuring the input is an instance of tokenizer? Would it still be possible to apply a token filters like WDF or lowercase filter before the HyphenationDecompunder?

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

No, the hyphenation decompounder would have to be the first token filter in the analysis chain.