CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization [LUCENE-8972]

asfimport commented 4 years ago

The ICU Transliteration API is currently exposed through Lucene only post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly dictionary-based) may assume pre-normalized input (e.g., for Chinese characters, there may be an assumption of traditional-only or simplified-only input characters, at the level of either all input, or per-dictionary-defined-token).

The potential usefulness of a CharFilter that exposes the ICU Transliteration API was suggested in a thread on the Solr mailing list, and my hope is that this issue can facilitate more detailed discussion of the proposed addition.

A concrete example of mixed traditional/simplified characters that are currently tokenized differently by the ICUTokenizer are:

红楼梦 (SSS)
紅樓夢 (TTT)
紅楼夢 (TST)

The first two tokens (simplified-only and traditional-only, respectively) are included in the CJ dictionary that backs ICUTokenizer, but the last (a mixture of traditional and simplified characters) is not, and is not recognized as a token. Even if we assume this to be an intentional omission from the dictionary that results in behavior that could be desirable for some use cases, there are surely some use cases that would benefit from a more permissive dictionary-based tokenization strategy (such as could be supported by pre-tokenizer transliteration).

Migrated from LUCENE-8972 by Michael Gibney (@magibney), updated Apr 08 2022

asfimport commented 4 years ago

Michael Gibney (@magibney) (migrated from JIRA)

For consideration, I believe this issue has already been tackled by @cbeer and @mejackreed; the resulting implementation can be found here.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I agree its a good idea, a couple thoughts about the impl you linked to:

its not clear to me the incremental conversion works for all the cases. I think this is easily solved with tests (especially test helpers like checkRandomData should "spoon-feed" reader data in various amounts). It also seems like it eventually just reads/transforms entire document in RAM, this is important to avoid for large documents. Maybe use of apis such as finishTransliteration/getMaximumContextLength is helpful there.
the tokenfilter has a hack to give better performance on common inputs. particularly by avoiding a lot of cpu when the input doesn't match the filter anyway (e.g. latin-1 in your example). Otherwise its painfully sloooooow. See the code where it says "this is cheating".

asfimport commented 4 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Thanks for the feedback/advice, @rmuir. Along the same lines as what you mention, I think some attention also needs to be payed to the resolution/accuracy of offset correction. I'm going to take a crack at this and hope to have something shortly.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yes, this would be another thing, good one for tests. But the whole idea is sound, I think you should be able to make it work!

asfimport commented 4 years ago

Michael Gibney (@magibney) (migrated from JIRA)

I have pushed PR #892, with the proposed new classes, tests, and docs. Initially I've mostly just used modified versions of the tests for ICUTransformFilter* ... (btw, testRandomStrings() is great!).

Most of the code complexity is due to the need to incrementally process one input character at a time in order to get offset correction as accurate as possible, and implement "rollback" (following the same approach as ICU Transliterator code does internally, but not exposed via public API).

The following discusses "rollback" in a little more depth, including some of the performance implications and an idea for future performance improvement:

Regarding "rollback", see comments "To understand the need for rollback" in source code for private method Transliterator#filteredTransliterate(Replaceable, Position, boolean, boolean). CompoundTransliterator's compliance with the extant top-level Transliterator abstraction here induces some serious performance hits (for some not-uncommon cases, like trailing NFC in the "Cyrillic-Latin" transliteration, shifting character blocks around on every incremental character insertion. FWIW, "incremental character insertion and rollback" is essentially how ICU handles this situation in the source code referenced above).

For future consideration (absent a change in the ICU API) I'm thinking that it might be possible to reimplement the essence of CompoundTransliterator in external (Lucene) application code, with separately tracked "position" for each "leaf" Transliterator in the Transliterator tree. This would allow positions that were blocked partway through depth-first traveral of the Transliterator tree to avoid:

being double-processing by (potentially not idempotent) leading Transliterators, and/or
bypassing trailing Transliterators on account of higher-level filters that block the partially-processed character

My sense is that the performance gain could be significant.

apache / lucene

CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization [LUCENE-8972] #10015