ICUNormalizer2CharFilter worst case is very slow [LUCENE-9177]

asfimport commented 4 years ago

ICUNormalizer2CharFilter is fast most of the times but we've had some report in Elasticsearch that some unrealistic data can slow down the process very significantly. For instance an input that consists of characters to normalize with no normalization-inert character in between can take up to several seconds to process few hundreds of kilo-bytes on my machine. While the input is not realistic, this worst case can slow down indexing considerably when dealing with uncleaned data.

I attached a small test that reproduces the slow processing using a stream that contains a lot of repetition of the character ℃ and no normalization-inert character. I am not surprised that the processing is slower than usual but several seconds to process seems a lot. Adding normalization-inert character makes the processing a lot more faster so I wonder if we can improve the process to split the input more eagerly ?

Migrated from LUCENE-9177 by Jim Ferenczi (@jimczi), resolved Jul 14 2021 Attachments: lucene.patch, LUCENE-9177_LUCENE-8972.patch, LUCENE-9177-benchmark-test.patch Pull requests: https://github.com/apache/lucene/pull/199

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

If they are just doing NFKC, then normalization won't impact most tokenizers (standard, icu) so just use the tokenfilter instead? it doesn't have these issues.

The charfilter should only be needed to try to "cleanup" for tokenizers that don't understand unicode, so that they will then tokenize properly.

asfimport commented 4 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

They use the kuromoji tokenizer so I think there's some value to apply NFKC as a char filter ?

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

@jimczi I adapted your informal "performance benchmark" test (LUCENE-9177_LUCENE-8972.patch) to also evaluate a third approach: instead of using ICUNormalizer2CharFilter, I tried using the new ICUTransform2CharFilter, under consideration at #10015 (and PR #15).

In place of nfkc_cf normalizer, I used a compound transliterator, configured as Any-NFKD; Any-CaseFold; Any-NFC, which should be (I think?) equivalent.

The performance improvement was quite dramatic:

1> normalized 400000 chars in 2364ms 1> transformed 400000 chars in 181ms 1> normalized 400000 chars in 49ms

The first of these uses the ICUNormalizer2CharFilter (the performance issue we're discussing); the second is the approach I introduced (ICUTransform2CharFilter); the third is your point of comparison using ICUNormalizerFilter (TokenFilter).

Note that the first two should be functionally identical. In contrast, the last ends up normalizing the whole input string at once (whitespace tokenizer over a string with no whitespace) and doesn't track any offsets.

My initial (admittedly fuzzy) hypothesis is that the performance difference is accounted for by the fact that the new ICUTransform2CharFilter uses the Replaceable ICU abstraction to do offset correction "from within", while ICUNormalizer2CharFilter requires firmer boundaries up front in order to do "external" offset correction?

In any case it's worth noting that for Japanese text, there may be other Transliteration ids that could be especially useful as CharFilters (i.e., pre-Tokenizer) – this was indeed the main motivation for #10015.

LUCENE-9177_LUCENE-8972.patch

```diff diff --git a/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java b/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java index bb1240cfd87..93f9aeaac45 100644 --- a/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java +++ b/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java @@ -21,13 +21,16 @@ import java.io.IOException; import java.io.Reader; import java.io.StringReader; import java.util.Arrays; +import com.ibm.icu.text.Transliterator; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.BaseTokenStreamTestCase; import org.apache.lucene.analysis.CharFilter; import org.apache.lucene.analysis.MockTokenizer; +import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.core.KeywordTokenizer; import org.apache.lucene.analysis.ngram.NGramTokenizer; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.TestUtil; public class TestICUNormalizer2CharFilter extends BaseTokenStreamTestCase { @@ -74,6 +77,79 @@ public class TestICUNormalizer2CharFilter extends BaseTokenStreamTestCase { input.length()); } + public void testVerySlow() throws IOException { + StringBuilder builder = new StringBuilder(); + final int charsPerOutputUnit = 4; + final int repetitions = 100000; + for (int i = 0; i < repetitions; i++) { + builder.append("℃aa"); + } + final int expectedResultLength = repetitions * charsPerOutputUnit; + StringBuilder result = new StringBuilder(expectedResultLength); + CharFilter reader = new ICUNormalizer2CharFilter(new StringReader(builder.toString()), + Normalizer2.getInstance(null, "nfkc_cf", Normalizer2.Mode.COMPOSE)); + char[] tempBuff = new char[1024]; + int total = 0; + long start = System.currentTimeMillis(); + while (true) { + int length = reader.read(tempBuff); + if (length == -1) { + break; + } + result.append(tempBuff, 0, length); + total += length; + } + long elapsed = System.currentTimeMillis() - start; + assertEquals(expectedResultLength, result.length()); + String first = result.toString(); + System.out.println("normalized " + total + " chars in " + elapsed + "ms"); + + result.setLength(0); + CharFilter reader2 = ICUTransform2CharFilterFactory.wrap(new StringReader(builder.toString()), + Transliterator.getInstance("Any-NFKD; Any-CaseFold; Any-NFC")); + total = 0; + start = System.currentTimeMillis(); + while (true) { + int length = reader2.read(tempBuff); + if (length == -1) { + break; + } + result.append(tempBuff, 0, length); + total += length; + } + elapsed = System.currentTimeMillis() - start; + assertEquals(first, result.toString()); + System.out.println("transformed " + total + " chars in " + elapsed + "ms"); + + + result.setLength(0); + Analyzer a = new Analyzer() { + @Override + public TokenStreamComponents createComponents(String fieldName) { + Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false); + return new TokenStreamComponents(tokenizer, new ICUNormalizer2Filter( + tokenizer, + Normalizer2.getInstance(null, "nfkc_cf", Normalizer2.Mode.COMPOSE))); + } + }; + start = System.currentTimeMillis(); + int numTokens = 0; + try (TokenStream tokenStream = a.tokenStream("", builder.toString())) { + tokenStream.reset(); + final CharTermAttribute termAtt = tokenStream.getAttribute(CharTermAttribute.class); + while (tokenStream.incrementToken()) { + numTokens ++; + result.append(termAtt.toString()); + } + tokenStream.end(); + } + a.close(); + assertEquals(first, result.toString()); + assertEquals(1, numTokens); + elapsed = System.currentTimeMillis() - start; + System.out.println("normalized " + total + " chars in " + elapsed + "ms"); + } + public void testTokenStream2() throws IOException { // '㌰', '<<'゙, '5', '℃', '№', '㈱', '㌘', 'ｻ', '<<', 'ｿ', '<<' String input = "㌰゙5℃№㈱㌘ｻﾞｿﾞ"; ```

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

In case anyone digs into this: my initial approach was to try to use an NFKC_CF transliterator directly; but there is no normalization transliterator registered for NFKC_CF.

I would have expected the ICU code to throw an "Invalid ID" IllegalArgumentException when requesting id NFKC_CF, but it doesn't fail – it hands you what I'm pretty sure is an NFKC instance (no _CF) instead (Transliterator.getInstance(String id) seems to be content with an unambiguous prefix?). This is a shame, because the NormalizationTransliterator ctor is private, so there's no easy way to do a direct comparison.

Apologies for seemingly hijacking this issue, but I thought the #10015 stuff was interesting, and useful at least as a possible alternative or point of comparison.

To turn the focus back to this issue in particular, I wonder whether it would be possible to set an arbitrary threshold at which input could be split more eagerly (as @jimczi initially suggested), but also rely on idempotency, normalizing text with a small amount of overlap (across the split boundary) to avoid introducing functional changes as a result of the arbitrary splitting. Perhaps this is relevant?:

While it is true that none of the normalization forms are closed under string concatenation, an optimized concatenation function can be written to produce a normalized concatenation from normalized strings. This is possible, because at most a few characters in the immediate area of the adjoined strings need processing.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Upon further investigation I actually don't think the "concatenation" question is a concern. I just opened a PR #199, which altogether avoids the "normalization-inert character" requirement in a way that I think should be safe/robust. I restored the "performance minitest patch" to something very like ( LUCENE-9177-benchmark-test.patch ) the original posted by Jim, with encouraging results:

1> normalized 400000 chars in 71ms 1> normalized 400000 chars in 64ms

I don't think this is at all detrimental to performance for the common case. If anything I think it might help somewhat by reducing required over-buffering even for the common case (because the extant buffering actually requires normalization-inert characters to be placed arbitrarily at the "128th char" – the last element in the internal buffer).

LUCENE-9177-benchmark-test.patch

```diff diff --git a/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java b/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java index 79b1dfe68f6..e1207d49363 100644 --- a/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java +++ b/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java @@ -25,9 +25,11 @@ import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.BaseTokenStreamTestCase; import org.apache.lucene.analysis.CharFilter; import org.apache.lucene.analysis.MockTokenizer; +import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.core.KeywordTokenizer; import org.apache.lucene.analysis.ngram.NGramTokenizer; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.TestUtil; public class TestICUNormalizer2CharFilter extends BaseTokenStreamTestCase { @@ -74,6 +76,69 @@ public class TestICUNormalizer2CharFilter extends BaseTokenStreamTestCase { input.length()); } + public void testVerySlow() throws IOException { + final boolean insertWhitespace = false; + StringBuilder builder = new StringBuilder(); + final int charsPerOutputUnit = insertWhitespace ? 5 : 4; + final int repetitions = 100000; + builder.append("℃aa"); + for (int i = 0; i < repetitions - 1; i++) { + if (insertWhitespace) { + builder.append(' '); + } + builder.append("℃aa"); + } + final int expectedResultLength = repetitions * charsPerOutputUnit - (insertWhitespace ? 1 : 0); + StringBuilder result = new StringBuilder(expectedResultLength); + CharFilter reader = new ICUNormalizer2CharFilter(new StringReader(builder.toString()), + Normalizer2.getInstance(null, "nfkc_cf", Normalizer2.Mode.COMPOSE)); + char[] tempBuff = new char[1024]; + int total = 0; + long start = System.currentTimeMillis(); + while (true) { + int length = reader.read(tempBuff); + if (length == -1) { + break; + } + result.append(tempBuff, 0, length); + total += length; + } + long elapsed = System.currentTimeMillis() - start; + assertEquals(expectedResultLength, result.length()); + String first = result.toString(); + System.out.println("normalized " + total + " chars in " + elapsed + "ms"); + + result.setLength(0); + Analyzer a = new Analyzer() { + @Override + public TokenStreamComponents createComponents(String fieldName) { + Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false); + return new TokenStreamComponents(tokenizer, new ICUNormalizer2Filter( + tokenizer, + Normalizer2.getInstance(null, "nfkc_cf", Normalizer2.Mode.COMPOSE))); + } + }; + start = System.currentTimeMillis(); + int numTokens = 0; + try (TokenStream tokenStream = a.tokenStream("", builder.toString())) { + tokenStream.reset(); + final CharTermAttribute termAtt = tokenStream.getAttribute(CharTermAttribute.class); + while (tokenStream.incrementToken()) { + if (numTokens > 0) { + result.append(' '); + } + numTokens ++; + result.append(termAtt.toString()); + } + tokenStream.end(); + } + a.close(); + assertEquals(first, result.toString()); + assertEquals(insertWhitespace ? repetitions : 1, numTokens); + elapsed = System.currentTimeMillis() - start; + System.out.println("normalized " + total + " chars in " + elapsed + "ms"); + } + public void testTokenStream2() throws IOException { // '㌰', '<<'゙, '5', '℃', '№', '㈱', '㌘', 'ｻ', '<<', 'ｿ', '<<' String input = "㌰゙5℃№㈱㌘ｻﾞｿﾞ"; ```

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

@jimczi, @rmuir: wondering if either of you have had a chance to look at the associated PR? It's a pretty manageable-sized change, and I think it directly addresses the concern raised in this issue. (fwiw, I beasted the TestICUNormalizer2CharFilter suite several hundred times and encountered no problems).

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i haven't had a chance to test it out, I didn't have any plan yet given that the randomized test is completely disabled: https://github.com/apache/lucene/blob/main/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java#L226-L227

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Interesting; some of the randomized tests were still enabled, but I confess I was relying on existing, enabled tests to catch regressions and had not considered the disabled test you pointed out. That said, I just re-enabled that test locally and am beasting without encountering any problems – neither on current main branch, nor with the patch for LUCENE-9177 applied. I wonder whether #6657 might have been fixed incidentally by some more general fix to CharFilter offset correction?

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

it may be the case. it shouldn't hold up your change really, sorry i've just been busy. I need to study the issue, but it sounds like the previous implementation did incremental normalization inefficiently, and the PR fixes this? there are more "safepoints" than just inert characters.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I re-enabled this test on top of your branch. I am beasting it (inefficiently: bash script) with this test re-enabled:

#!/usr/bin/env bash
set -ex
while true; do
  ./gradlew -p lucene/analysis/icu -Dtests.nightly=true -Dtest.multiplier=10 test
done

I'll give it a little time to run. I'm not sure if we should re-enable the test for this issue. Nobody ever debugged to the bottom of why it failed. In the past we have found bugs in ICU with our random tests... an ICU upgrade may have fixed the issue (or dodged it via changes to unicode).

At the same time this component is super-hairy and needs some serious testing :)

Honestly, I didn't understand this charfilter's logic before, but I will give a try to reviewing this PR. For sure, we shouldn't be looking for inert characters.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit c3482c99ffd9b30acb423e63760ebc7baab9dd26 in lucene's branch refs/heads/main from Michael Gibney https://gitbox.apache.org/repos/asf?p=lucene.git;h=c3482c9

LUCENE-9177: ICUNormalizer2CharFilter streaming no longer depends on presence of normalization-inert characters (#199)

Normalization-inert characters need not be required as boundaries for incremental processing. It is sufficient to check hasBoundaryAfter and hasBoundaryBefore, substantially improving worst-case performance.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks @magibney a lot for taking care of this! I'm backporting this fix to 8.10 due to the performance trap (doing some more testing first).

For the #6657 test, let's discuss that over there. I will open a PR.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 4c95d3ef597dd12bbcfa0153f516539fca0a8e69 in lucene-solr's branch refs/heads/branch_8x from Michael Gibney https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c95d3e

LUCENE-9177: ICUNormalizer2CharFilter streaming no longer depends on presence of normalization-inert characters (#199)

Normalization-inert characters need not be required as boundaries for incremental processing. It is sufficient to check hasBoundaryAfter and hasBoundaryBefore, substantially improving worst-case performance.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Thanks @rmuir!

asfimport commented 2 years ago

Timothy Potter (@thelabdude) (migrated from JIRA)

Closing after the 8.10.0 release

apache / lucene

ICUNormalizer2CharFilter worst case is very slow [LUCENE-9177] #10217