Character Mapping [LUCENE-7321]

asfimport commented 8 years ago

One of the challenges in search is recall of an item with a common typing variant. These cases can be as simple as lower/upper case in most languages, accented characters, or more complex morphological phenomena like prefix omitting, or constructing a character with some combining mark. This component addresses the cases, which are not covered by ASCII folding component, or more complex to design with other tools. The idea is that a linguist could provide the mappings in a tab-delimited file, which then can be directly used by Solr.

The mappings are maintained in the tab-delimited file, which could be just a copy paste from Excel spreadsheet. This gives the linguists the opportunity to create the mappings, then for the developer to include them in Solr configuration. There are a few cases, when the mappings grow complex, where some additional debugging may be required. The mappings can contain any sequence of characters to any other sequence of characters.

Some of the cases I discuss in detail document are handling the voiced vowels for Japanese; common typing substitutions for Korean, Russian, Polish; transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding for Japanese. In the appendix, I give an example of implementing a Russian light weight stemmer using this component.

Migrated from LUCENE-7321 by Ivan Provalov, 2 votes, updated Jun 03 2021 Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch

asfimport commented 8 years ago

Ivan Provalov (migrated from JIRA)

Initial patch.

asfimport commented 8 years ago

Ivan Provalov (migrated from JIRA)

Detail component description.

asfimport commented 8 years ago

Koji Sekiguchi (@kojisekig) (migrated from JIRA)

What is the advantage of this compared to MappingCharFilter?

asfimport commented 8 years ago

Ivan Provalov (migrated from JIRA)

Koji, this one works on a token level, allowing do things like prefix/suffix manipulations. Graph generator and collapser also makes it user friendly when dealing with a lot of mappings (please see the attached description file).

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

Hi is an any plan to integrate it to the Lucene\Solr?

asfimport commented 6 years ago

Nick Chervov (migrated from JIRA)

Hi everyone! Is there any chance to get better Russian support in future releases of Solr?

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

There's a great chance if someone submits a patch and it gets committed. It's only because people step up and volunteer to improve things that language support improves...

asfimport commented 6 years ago

Ivan Provalov (migrated from JIRA)

@erike4711@yahoo.com, any progress on committing this patch?

Thanks,

Ivan

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Ivan Provalov Ohhh, you would have to skewer me wouldn't you? I have no idea about the merits of this patch, this isn't something I work with.

Does it apply to master? and what does it do?

asfimport commented 6 years ago

Ivan Provalov (migrated from JIRA)

@ErickErickson,

Good questions:

I just ran the tests in the patch against the master, they passed.
It allows you to configure/modify morphological analysis with externalized mapping files. I attached a description and a reference implementation of the Russian stemmer using this filter.

Thanks,

Ivan

asfimport commented 6 years ago

Alexandre Rafalovitch (@arafalov) (migrated from JIRA)

This feels a little bit like too many use-cases folded into one piece of code. Arabic, Japanese, Korean names special handling, Russian already covered by the stemmer.

I am not sure what the clean use-case is here. Especially with say PatternReplaceCharFilterFactory being there to cover possible special use-case gaps (at a lower performance perhaps). And with ICU4J possibly covering others.

asfimport commented 6 years ago

Ivan Provalov (migrated from JIRA)

@arafalov, the clean use case is for this filter is to externalize the morphological modifications rules. Most stemmers have hard-coded rules. With this one, the rules are expressed in the flat mapping files and configurations. Originally, it was developed to extend a few cases for some languages listed here and a few other languages, as well as to visualize these rules which would help the linguists involved in the project to understand the modification rules for more complex scenarios. I added the Russian stemmer implementation as a general reference just to show how one can configure the entire stemmer implementation without hard-coded rules. We have not seen any performance issues with this so far. Hope this helps.

asfimport commented 3 years ago

Marcus Eagan (@marcussorealheis) (migrated from JIRA)

Hi Ivan Provalov I'm curious if you have been maintaining this patch through version 8 for your company? If so, do you want to revive this discussion?

asfimport commented 3 years ago

Ivan Provalov (migrated from JIRA)

@marcussorealheis, I have been maintaining it (bug fixes, etc...), not upgraded to version 8 yet. I could do that if there is any interest in integrating it.

apache / lucene

Character Mapping [LUCENE-7321] #8375