Closed asfimport closed 3 years ago
Robert Muir (@rmuir) (migrated from JIRA)
If they are just doing NFKC, then normalization won't impact most tokenizers (standard, icu) so just use the tokenfilter instead? it doesn't have these issues.
The charfilter should only be needed to try to "cleanup" for tokenizers that don't understand unicode, so that they will then tokenize properly.
Jim Ferenczi (@jimczi) (migrated from JIRA)
They use the kuromoji
tokenizer so I think there's some value to apply NFKC as a char filter ?
Michael Gibney (@magibney) (migrated from JIRA)
@jimczi I adapted your informal "performance benchmark" test (LUCENE-9177_LUCENE-8972.patch) to also evaluate a third approach: instead of using ICUNormalizer2CharFilter
, I tried using the new ICUTransform2CharFilter
, under consideration at #10015 (and PR #15).
In place of nfkc_cf
normalizer, I used a compound transliterator, configured as Any-NFKD; Any-CaseFold; Any-NFC
, which should be (I think?) equivalent.
The performance improvement was quite dramatic:
1> normalized 400000 chars in 2364ms 1> transformed 400000 chars in 181ms 1> normalized 400000 chars in 49ms
The first of these uses the ICUNormalizer2CharFilter (the performance issue we're discussing); the second is the approach I introduced (ICUTransform2CharFilter); the third is your point of comparison using ICUNormalizerFilter (TokenFilter).
Note that the first two should be functionally identical. In contrast, the last ends up normalizing the whole input string at once (whitespace tokenizer over a string with no whitespace) and doesn't track any offsets.
My initial (admittedly fuzzy) hypothesis is that the performance difference is accounted for by the fact that the new ICUTransform2CharFilter uses the Replaceable
ICU abstraction to do offset correction "from within", while ICUNormalizer2CharFilter requires firmer boundaries up front in order to do "external" offset correction?
In any case it's worth noting that for Japanese text, there may be other Transliteration ids that could be especially useful as CharFilters (i.e., pre-Tokenizer) – this was indeed the main motivation for #10015.
Michael Gibney (@magibney) (migrated from JIRA)
In case anyone digs into this: my initial approach was to try to use an NFKC_CF
transliterator directly; but there is no normalization transliterator registered for NFKC_CF.
I would have expected the ICU code to throw an "Invalid ID" IllegalArgumentException when requesting id NFKC_CF
, but it doesn't fail – it hands you what I'm pretty sure is an NFKC
instance (no _CF
) instead (Transliterator.getInstance(String id)
seems to be content with an unambiguous prefix?). This is a shame, because the NormalizationTransliterator ctor is private, so there's no easy way to do a direct comparison.
Apologies for seemingly hijacking this issue, but I thought the #10015 stuff was interesting, and useful at least as a possible alternative or point of comparison.
To turn the focus back to this issue in particular, I wonder whether it would be possible to set an arbitrary threshold at which input could be split more eagerly (as @jimczi initially suggested), but also rely on idempotency, normalizing text with a small amount of overlap (across the split boundary) to avoid introducing functional changes as a result of the arbitrary splitting. Perhaps this is relevant?:
While it is true that none of the normalization forms are closed under string concatenation, an optimized concatenation function can be written to produce a normalized concatenation from normalized strings. This is possible, because at most a few characters in the immediate area of the adjoined strings need processing.
Michael Gibney (@magibney) (migrated from JIRA)
Upon further investigation I actually don't think the "concatenation" question is a concern. I just opened a PR #199, which altogether avoids the "normalization-inert character" requirement in a way that I think should be safe/robust. I restored the "performance minitest patch" to something very like ( LUCENE-9177-benchmark-test.patch ) the original posted by Jim, with encouraging results:
1> normalized 400000 chars in 71ms 1> normalized 400000 chars in 64ms
I don't think this is at all detrimental to performance for the common case. If anything I think it might help somewhat by reducing required over-buffering even for the common case (because the extant buffering actually requires normalization-inert characters to be placed arbitrarily at the "128th char" – the last element in the internal buffer).
Michael Gibney (@magibney) (migrated from JIRA)
@jimczi, @rmuir: wondering if either of you have had a chance to look at the associated PR? It's a pretty manageable-sized change, and I think it directly addresses the concern raised in this issue. (fwiw, I beasted the TestICUNormalizer2CharFilter
suite several hundred times and encountered no problems).
Robert Muir (@rmuir) (migrated from JIRA)
i haven't had a chance to test it out, I didn't have any plan yet given that the randomized test is completely disabled: https://github.com/apache/lucene/blob/main/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java#L226-L227
Michael Gibney (@magibney) (migrated from JIRA)
Interesting; some of the randomized tests were still enabled, but I confess I was relying on existing, enabled tests to catch regressions and had not considered the disabled test you pointed out. That said, I just re-enabled that test locally and am beasting without encountering any problems – neither on current main branch, nor with the patch for LUCENE-9177 applied. I wonder whether #6657 might have been fixed incidentally by some more general fix to CharFilter offset correction?
Robert Muir (@rmuir) (migrated from JIRA)
it may be the case. it shouldn't hold up your change really, sorry i've just been busy. I need to study the issue, but it sounds like the previous implementation did incremental normalization inefficiently, and the PR fixes this? there are more "safepoints" than just inert characters.
Robert Muir (@rmuir) (migrated from JIRA)
I re-enabled this test on top of your branch. I am beasting it (inefficiently: bash script) with this test re-enabled:
#!/usr/bin/env bash
set -ex
while true; do
./gradlew -p lucene/analysis/icu -Dtests.nightly=true -Dtest.multiplier=10 test
done
I'll give it a little time to run. I'm not sure if we should re-enable the test for this issue. Nobody ever debugged to the bottom of why it failed. In the past we have found bugs in ICU with our random tests... an ICU upgrade may have fixed the issue (or dodged it via changes to unicode).
At the same time this component is super-hairy and needs some serious testing :)
Honestly, I didn't understand this charfilter's logic before, but I will give a try to reviewing this PR. For sure, we shouldn't be looking for inert characters.
ASF subversion and git services (migrated from JIRA)
Commit c3482c99ffd9b30acb423e63760ebc7baab9dd26 in lucene's branch refs/heads/main from Michael Gibney https://gitbox.apache.org/repos/asf?p=lucene.git;h=c3482c9
LUCENE-9177: ICUNormalizer2CharFilter streaming no longer depends on presence of normalization-inert characters (#199)
Normalization-inert characters need not be required as boundaries
for incremental processing. It is sufficient to check hasBoundaryAfter
and hasBoundaryBefore
, substantially improving worst-case performance.
Robert Muir (@rmuir) (migrated from JIRA)
Thanks @magibney a lot for taking care of this! I'm backporting this fix to 8.10 due to the performance trap (doing some more testing first).
For the #6657 test, let's discuss that over there. I will open a PR.
ASF subversion and git services (migrated from JIRA)
Commit 4c95d3ef597dd12bbcfa0153f516539fca0a8e69 in lucene-solr's branch refs/heads/branch_8x from Michael Gibney https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c95d3e
LUCENE-9177: ICUNormalizer2CharFilter streaming no longer depends on presence of normalization-inert characters (#199)
Normalization-inert characters need not be required as boundaries
for incremental processing. It is sufficient to check hasBoundaryAfter
and hasBoundaryBefore
, substantially improving worst-case performance.
Michael Gibney (@magibney) (migrated from JIRA)
Thanks @rmuir!
Timothy Potter (@thelabdude) (migrated from JIRA)
Closing after the 8.10.0 release
ICUNormalizer2CharFilter is fast most of the times but we've had some report in Elasticsearch that some unrealistic data can slow down the process very significantly. For instance an input that consists of characters to normalize with no normalization-inert character in between can take up to several seconds to process few hundreds of kilo-bytes on my machine. While the input is not realistic, this worst case can slow down indexing considerably when dealing with uncleaned data.
I attached a small test that reproduces the slow processing using a stream that contains a lot of repetition of the character
℃
and no normalization-inert character. I am not surprised that the processing is slower than usual but several seconds to process seems a lot. Adding normalization-inert character makes the processing a lot more faster so I wonder if we can improve the process to split the input more eagerly ?Migrated from LUCENE-9177 by Jim Ferenczi (@jimczi), resolved Jul 14 2021 Attachments: lucene.patch, LUCENE-9177_LUCENE-8972.patch, LUCENE-9177-benchmark-test.patch Pull requests: https://github.com/apache/lucene/pull/199