apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

TestKoreanTokenizer#testRandomHugeStrings failure [LUCENE-8676] #9722

Closed asfimport closed 5 years ago

asfimport commented 5 years ago

KoreanTokenizer#testRandomHugeString failed in CI with the following exception:

  [junit4]    > Throwable #1: java.lang.AssertionError
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([8C5E2BE10F581CB:90E6857D4E833D83]:0)
   [junit4]    >        at org.apache.lucene.analysis.ko.KoreanTokenizer.add(KoreanTokenizer.java:334)
   [junit4]    >        at org.apache.lucene.analysis.ko.KoreanTokenizer.parse(KoreanTokenizer.java:707)
   [junit4]    >        at org.apache.lucene.analysis.ko.KoreanTokenizer.incrementToken(KoreanTokenizer.java:377)
   [junit4]    >        at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:748)
   [junit4]    >        at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:659)
   [junit4]    >        at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:561)
   [junit4]    >        at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:474)
   [junit4]    >        at org.apache.lucene.analysis.ko.TestKoreanTokenizer.testRandomHugeStrings(TestKoreanTokenizer.java:313)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> NOTE: leaving temporary files

I am able to reproduce locally with:

ant test  -Dtestcase=TestKoreanTokenizer -Dtests.method=testRandomHugeStrings -Dtests.seed=8C5E2BE10F581CB -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-7.7/test-data/enwiki.random.lines.txt -Dtests.locale=uk-UA -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1

After some investigation I found out that the position of the buffer is not updated when the maximum backtrace size is reached (1024).


Migrated from LUCENE-8676 by Jim Ferenczi (@jimczi), resolved Feb 01 2019 Attachments: LUCENE-8676.patch

asfimport commented 5 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

Here is a patch that makes sure that we update the position when we reached the maximum backtrace size.

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit e9c02a6f71de3615a5c90f51b66f3709cbbd5e47 in lucene-solr's branch refs/heads/master from Jim Ferenczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e9c02a6

LUCENE-8676: The Korean tokenizer does not update the last position if the backtrace is caused by a big buffer (1024 chars).

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit e3ac4c9180a0eb6f1c7a3e49d1a8cda8669ae3fa in lucene-solr's branch refs/heads/branch_8x from Jim Ferenczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e3ac4c9

LUCENE-8676: The Korean tokenizer does not update the last position if the backtrace is caused by a big buffer (1024 chars).

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit bae3e24e8bcdac9a07d2b0592cba72bed2e5365e in lucene-solr's branch refs/heads/branch_8_0 from Jim Ferenczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bae3e24

LUCENE-8676: The Korean tokenizer does not update the last position if the backtrace is caused by a big buffer (1024 chars).

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit 5667170cf58732384f185b2983b1f5a21d26369e in lucene-solr's branch refs/heads/branch_7x from Jim Ferenczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5667170

LUCENE-8676: The Korean tokenizer does not update the last position if the backtrace is caused by a big buffer (1024 chars).

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit e05ed2ffb5a2df20163af9a7d8ea425b4218cade in lucene-solr's branch refs/heads/branch_7_7 from Jim Ferenczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e05ed2f

LUCENE-8676: The Korean tokenizer does not update the last position if the backtrace is caused by a big buffer (1024 chars).