apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Incorrect behavior for TestLaoBreakIterator.isWord() [LUCENE-5076] #6140

Open asfimport opened 11 years ago

asfimport commented 11 years ago

The incorrect behavior appears in version 4.3.1 and in revision 1496055.

Method "TestLaoBreakIterator.isWord" contains this loop:

for (int i = start; i < end; i += UTF16.getCharCount(codepoint)) {
    codepoint = UTF16.charAt(text, 0, end, start);

    if (UCharacter.isLetterOrDigit(codepoint))
        return true;
}

It appears that the code is reading only one character again and again, irrespective of "i". This looks incorrect. I think the code inside the loop should use "i", e.g., read characters based on "i".

If the intended behavior is to read only one character, then the loop should not be necessary.

A similar problem appears in method "BreakIteratorWrapper.BIWrapper.calcStatus" for this loop:

for (int i = begin; i < end; i += UTF16.getCharCount(codepoint)) {
    codepoint = UTF16.charAt(text, 0, end, begin);

    if (UCharacter.isDigit(codepoint))
        return RuleBasedBreakIterator.WORD_NUMBER;
    else if (UCharacter.isLetter(codepoint)) {
        // TODO: try to separately specify ideographic, kana? 
        // [currently all bundled as letter for this case]
        return RuleBasedBreakIterator.WORD_LETTER;
    }
}

Again, the computation inside the loop does not use "i", which seems incorrect. It appears that the code is reading only one character again and again, irrespective of "i".


Migrated from LUCENE-5076 by Adrian Nistor Environment:

any
asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks for reporting this: I think its a bug.