FuzzyTermEnums sets negative boost for fuzzy search & highlight [LUCENE-9568]

asfimport commented 4 years ago

Description When user indexes a word with an apostrophe and constructs a fuzzy query for highlighter, it throws an exception with set negative boost for a query.

Repro Steps

Index a text with apostrophe. E.g. doesn't
Parse a fuzzy query e.g.: se\~, se\~2, se\~3
Try to highlight a text with apostrophe
The exception is thrown (for details see attached test test with repro steps)

Actual Result java.lang.IllegalArgumentException: boost must be a positive float, got -1.0

Expected Result

No exception.
Highlighting marks are inserted into a text.

Workaround

not known.

Migrated from LUCENE-9568 by Juraj Jurčo, 1 vote, updated Oct 14 2020 Attachments: FindSqlHighlightTest.java

asfimport commented 4 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Here's the user list discussion leading to this issue: https://markmail.org/thread/2gomdeis3xs7gs5t

... where @msokolov suggested:

I traced this to this block in FuzzyTermsEnum:

    if (ed == 0) { // exact match
      boostAtt.setBoost(1.0F);
    } else {
      final int codePointCount = UnicodeUtil.codePointCount(term);
      int minTermLength = Math.min(codePointCount, termLength);

      float similarity = 1.0f - (float) ed / (float) minTermLength;
      boostAtt.setBoost(similarity);
    }

where in your test ed (edit distance) was 2 and minTermLength 1,
leading to negative boost.

I don't really understand this code at all, but I wonder if it should
divide by maxTermLength instead of minTermLength?

I think his proposal makes sense? That would make similarity 0 for this case, which seems right, since the term is length 1, the candidate is edit distance 2 away, and is surely very irrelevant!

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This will change the top hits returned by fuzzy queries so I suspect that some users will be a bit angry, but I can't think of a reason why minTermLength makes more sense than maxTermLength so +1 to this suggestion to avoid the case when the edit distance is greater than the minimum term length.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It will break most of the top-N optimizations of the query. If you use the max term length to compute the boost, then the PQ can never optimize. That would be because there could always exist some unseen term with a length of say: 1MB (just for illustration) which would rank extremely high.

By using the min, it is bounded by the query term, and once that PQ fills with ed=2, the query can restrict itself further and only look for ed=1, ed=0, etc. Practically it is this way because that's how fuzzy was defined, with the min and the tie-breaker after this boost being term sort order (you can check the code of an ancient version like 2.9 to see: https://github.com/apache/lucene-solr/blob/releases/lucene/2.9.2/src/java/org/apache/lucene/search/FuzzyTermEnum.java#L189 ). We just exploited it for all its worth. It is especially important for small PQ (top-N) sizes such as spell checking.

Hopefully I have explained it ok, this thing is hairy :) Happy to try again if needed.

I think first we should make a test, ideally one that doesn't use highlighting? I think there should be an alternative, simpler fix that won't break the top-N optimization.

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

FWIW comments in TopTermsRewrite suggests that we used to allow negative boosts: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.6.3/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L163-L165.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yeah, and I think it also explains why the problem only happens with highlighter. I'm guessing it doesn't use TopTermsRewrite, but something else.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> Hopefully I have explained it ok, this thing is hairy Happy to try again if needed.

Thanks @rmuir that helps; I think true understanding would require me to bang my head on the code though :)

asfimport commented 4 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Whoa, thanks @rmuir – that is indeed a very important reason to keep using minTermLength! I forgot about that :)

Maybe we could just do Math.max(0, similarity) to ensure we never try to set negative boost?

Let's also add a comment above where we set minTermLength indicating the importance?

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not sure its so simple: today with TopTermsRewrite we may give it negatives, and everything is happily sorted and top-n'd correctly, and it will just truncate them at the very end (when it goes to create the actual Query object). So it still gives a correct top-N. See Adrien's link.

If we were to do it any "sooner" (e.g. in fuzzytermsenum itself), it may compute an incorrect top-N.

I think the problem is just that only TopTermsRewrite does this (honestly, it is really the only one that should be used with this crazy query). The other more exotic rewrite methods that might be used by spans or highlighters don't do it, so there will be problems today in those cases.

cknoxrun commented 1 year ago

You might already be thinking of this beyond apostrophes, but just in case I've recently come across this (with Elasticsearch) for characters other than apostrophes, specifically Thai language characters (E.g. "การที่ได้ต้องแงงว่").

apache / lucene

FuzzyTermEnums sets negative boost for fuzzy search & highlight [LUCENE-9568] #10608