Adding phonetic search algorithms to find Persons & LegalEntities

TheophileDiot commented 3 years ago

Why using phonetic search algorithms ?

Spelling errors are generally grouped into classes: typographic and cognitive. Cognitive errors occur when the writer does not know how to spell a word. In these cases, the misspelling often has the same pronunciation as the correct word ( for example writing Vladimir as Vladmir). Typographic errors are mostly errors related to the keyboard; e.g., substitution or transposition of two letters because their keys are close on the keyboard. Phonetics algorithms are used to reduce Cognitive errors.

With @ang-st we analyzed the current search method used by aleph and we found out that it can be improved.

With the use of phonetic algorithms it will allow users to expand the search output in case there is a typo in the name of the person researched or if the user doesn't know how to spell the name of the person correctly.

Theses features can be added to the advanced search with a check-boxes or with a new field or something else.

Examples : Soundex algorithm, NYSIIS, Metaphone algorithm, ect...

pudo commented 3 years ago

Hello @TheophileDiot, sorry for the slow response on this - I didn't realise you guys were actually offering to build this :) I don't have very much experience with using phonetic algorithms in practice, so excuse me if some of the thoughts below are a bit ignorant:

In terms of choosing a phonetic method, we should bias for the ones that are more resistant to different languages (I recall reading that some of the earlier ones were very focussed on English and German, but better ones now exist).
I imagine the same transformation will need to be applied both to the indexed tokens and to the query. Since we try to avoid parsing ES query_text queries in Aleph, this means the best place to do the transform would be in ElasticSearches own processing pipeline.
- They have a phonetic token filter plugin that seems to many options. We'd need to bake that into the ES docker image. We already use custom analysers for synonym expansion, and you can find the relevant config options here - these also use ICU to do transliteration, which you may want to include before you apply the phonetic filter.
I assume you don't want to do a phonetic form of all text in the Aleph index, but merely the names of companies and people. There is already a somewhat similar denormalisation happening, called fingerprints (its a super harsh name normalisation algo). You can find the relevant code here and the query re-write here. The fingerprints field is then also analysed with a custom analyzer. I imagine the phonetics could be done similar.

I'm really curious what kind of impact using phonetic filters is going to have for subjective result quality. It could be a cool way to really broaden out a query when the more strict forms don't return results....

ang-st commented 3 years ago

Hi @pudo ,

Thanks for those great insights !

Using ES plugin for phonetic tokenization could be a great solution as we could expect them to be faster than any python implementation. Would require an optional plugin from ES be acceptable ?
Indeed using a transliteration filter prior to apply phonetic transformation is the way to go. I did not dive in the code deeply but the ICU transformation look like it is done via ES as well, am I right ?
Finally I don't get the purpose of fingerprints, could you clarify ?

Assuming i'm not mistaken regarding point 1 and 2, i guess we could start to play "manually" with them to get acquainted and then do some informal testing to assess subjective result quality

pudo commented 3 years ago

hey @ang-st! I'll take your points in order:

Yes please, see the icu plugin as reference.
Correct (we also do ICU client-side, but it's really desirable to run queries with the same implementation as was used to index).
So the idea of fingerprints is that you can have two records for a company - let's say Banana Republic Aktiengesellschaft and Banana Republic (Panama) AG - that look very different, but there are conventions for doing a rewrite. After running the fingerprints.generate() method on these two names, they would both be ag banana republic. Now, we want to index these normalised forms, but not mix them with the unmodified versions. So we store them in an extra field called fingerprints which gets queried instead of the normal names list whenever we want to do a name-based search that's biased for recall. I merely mentioned this because it's a very very similar idea to the phonetics - I even imagine you could use the values in fingerprints and just add them to a second field that gets analysed accordingly.

ghost commented 3 years ago

@ang-st

alephdata / aleph

Adding phonetic search algorithms to find Persons & LegalEntities #1603

Why using phonetic search algorithms ?