Open TheophileDiot opened 3 years ago
Hello @TheophileDiot, sorry for the slow response on this - I didn't realise you guys were actually offering to build this :) I don't have very much experience with using phonetic algorithms in practice, so excuse me if some of the thoughts below are a bit ignorant:
query_text
queries in Aleph, this means the best place to do the transform would be in ElasticSearches own processing pipeline.
fingerprints
(its a super harsh name normalisation algo). You can find the relevant code here and the query re-write here. The fingerprints
field is then also analysed with a custom analyzer. I imagine the phonetics could be done similar.I'm really curious what kind of impact using phonetic filters is going to have for subjective result quality. It could be a cool way to really broaden out a query when the more strict forms don't return results....
Hi @pudo ,
Thanks for those great insights !
Using ES plugin for phonetic tokenization could be a great solution as we could expect them to be faster than any python implementation. Would require an optional plugin from ES be acceptable ?
Indeed using a transliteration filter prior to apply phonetic transformation is the way to go. I did not dive in the code deeply but the ICU transformation look like it is done via ES as well, am I right ?
Finally I don't get the purpose of fingerprints
, could you clarify ?
Assuming i'm not mistaken regarding point 1 and 2, i guess we could start to play "manually" with them to get acquainted and then do some informal testing to assess subjective result quality
hey @ang-st! I'll take your points in order:
Yes please, see the icu plugin as reference.
Correct (we also do ICU client-side, but it's really desirable to run queries with the same implementation as was used to index).
So the idea of fingerprints
is that you can have two records for a company - let's say Banana Republic Aktiengesellschaft
and Banana Republic (Panama) AG
- that look very different, but there are conventions for doing a rewrite. After running the fingerprints.generate() method on these two names, they would both be ag banana republic
. Now, we want to index these normalised forms, but not mix them with the unmodified versions. So we store them in an extra field called fingerprints
which gets queried instead of the normal names list whenever we want to do a name-based search that's biased for recall. I merely mentioned this because it's a very very similar idea to the phonetics - I even imagine you could use the values in fingerprints
and just add them to a second field that gets analysed accordingly.
@ang-st
Why using phonetic search algorithms ?
Spelling errors are generally grouped into classes: typographic and cognitive. Cognitive errors occur when the writer does not know how to spell a word. In these cases, the misspelling often has the same pronunciation as the correct word ( for example writing Vladimir as Vladmir). Typographic errors are mostly errors related to the keyboard; e.g., substitution or transposition of two letters because their keys are close on the keyboard. Phonetics algorithms are used to reduce Cognitive errors.
With @ang-st we analyzed the current search method used by aleph and we found out that it can be improved.
With the use of phonetic algorithms it will allow users to expand the search output in case there is a typo in the name of the person researched or if the user doesn't know how to spell the name of the person correctly.
Theses features can be added to the advanced search with a check-boxes or with a new field or something else.
Examples : Soundex algorithm, NYSIIS, Metaphone algorithm, ect...