apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.03k forks source link

Assertion error in JapaneseTokenizer / KoreanTokenizer backtrace [LUCENE-10059] #11097

Closed asfimport closed 3 years ago

asfimport commented 3 years ago

There is a rare case which causes an AssertionError in the backtrace step of JapaneseTokenizer that we (Amazon Product Search) found in our tests.

If there is a text span of length 1024 (determined by MAX_BACKTRACE_GAP) where the regular backtrace is not called, a forced backtrace will be applied. If the partially best path at this point happens to end at the last pos, and since there is always a final backtrace applied at the end, the final backtrace will try to backtrace from and to the same position, causing an AssertionError in RollingCharBuffer.get() when it tries to generate an empty buffer.

We are fixing it by returning prematurely in the backtrace() method when the from and to pos are the same:

    if (endPos == lastBackTracePos) {
      return;
    }

The backtrace() method is essentially no-op when this condition happens, thus when -ea is not enabled, it can still output the correct tokens.

We will open a PR for this issue.


Migrated from LUCENE-10059 by Anh Dung Bui, resolved Aug 30 2021 Attachments: LUCENE-10059-nori-9x.patch Pull requests: https://github.com/apache/lucene/pull/254, https://github.com/apache/lucene/pull/254, https://github.com/apache/lucene-solr/pull/2557, https://github.com/apache/lucene/pull/284

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 0c3c8ec09a7000889bb073fe982bb0a1b81933d2 in lucene's branch refs/heads/main from Dzung Bui https://gitbox.apache.org/repos/asf?p=lucene.git;h=0c3c8ec

LUCENE-10059: Fix an AssertionError when JapaneseTokenizer tries to backtrace from and to the same position (#254)

Co-authored-by: Anh Dung Bui <buidun@amazon.com>

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 0fc8784ff7bfea8b2c3f2c0fa8a99826bfa40c6d in lucene-solr's branch refs/heads/branch_8x from Dzung Bui https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0fc8784

LUCENE-10059: Fix an AssertionError when JapaneseTokenizer tries to backtrace from and to the same position (#254) (#2557)

asfimport commented 3 years ago

Anh Dung Bui (migrated from JIRA)

The PR is merged, but I realized KoreanTokenizer has the same issue. And I also realized it is sharing a significant amount of code with JapaneseTokenizer. Should we try to have a base class (maybe in analysis/common?) for all the shared code?

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> Should we try to have a base class (maybe in analysis/common?) for all the shared code?

I think so? I never really understood why this was so aggressively forked. There was quite some discussion about where to put the shared code if we did do this refactor, but I think we got bogged down and never sorted out this admin detail. #9860 might be related. Mayabe @mocobeta or @danmuzi will remember where this got left...

asfimport commented 3 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

I'm sorry, I'm not familiar with the particular reason why KoreanTokenizer had to be forked from JapaneseTokenizer when it was created. I think that might have been explained or discussed in the original KoreanTokenizer issue but can't find it soon.

asfimport commented 3 years ago

Anh Dung Bui (migrated from JIRA)

Since both of them are depending on analysis/common, I think we could put the shared code there, the unit test might be a bit difficult since it depends on the dictionary.

I found the original Jira of Nori tokenizer (#9278), it seems it was created with some tweaks here and then from the Kuromoji and it was agreed that it's not needed initially to share code between the two.

I can put a PR to fix the issue in nori then another (Jira) to refactor, or I can also try to put both in one PR. But the Kuromoji tokenizer is quite complicated and there could be a lot of code to refactor out, I think the former is better. Does anyone have a preference on this?

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Did we already backport this fix to 8.10 as well?  In which case let's resolve this issue?

I can put a PR to fix the issue in nori then another (Jira) to refactor, or I can also try to put both in one PR. But the Kuromoji tokenizer is quite complicated and there could be a lot of code to refactor out, I think the former is better. Does anyone have a preference on this? 

Yeah +1 to do the two PRs separately!

asfimport commented 3 years ago

Anh Dung Bui (migrated from JIRA)

Thanks @mikemccand, we already backported to Lucene 8.x. I'll resolve this and create follow-up JIRA.

asfimport commented 3 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

I opened https://github.com/apache/lucene/pull/284 to generalize the fix to handle all backtraces (best path and nbest). I am also working on a separate PR to apply the fix to the Korean tokenizer. Thanks Anh Dung Bui and @mikemccand!

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release

asfimport commented 2 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

I am also working on a separate PR to apply the fix to the Korean tokenizer.

The same check was applied to Nori along with the recent refactoring in both tokenizers https://github.com/apache/lucene/pull/805. The whole change can't be applied to branch_9x, I'll backport the boundary check for Nori to the 9x branch soon; we'll have it in 9.2.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit cbf2e64c44f4a9e35afd22aefa51a2ac52c79b2d in lucene's branch refs/heads/branch_9x from Tomoko Uchida https://gitbox.apache.org/repos/asf?p=lucene.git;h=cbf2e64c44f

LUCENE-10059: Apply the same change to KoreanTokenizer

asfimport commented 2 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Pushed the same change for KoreanTokenizer in branch_9x. For the main branch, we can apply a more general fix https://github.com/apache/lucene/pull/284 after resolving the conflicts.