alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2.04k stars 272 forks source link

Adding phonetic search algorithms to find Persons & LegalEntities #1603

Open TheophileDiot opened 3 years ago

TheophileDiot commented 3 years ago

Why using phonetic search algorithms ?

Spelling errors are generally grouped into classes: typographic and cognitive. Cognitive errors occur when the writer does not know how to spell a word. In these cases, the misspelling often has the same pronunciation as the correct word ( for example writing Vladimir as Vladmir). Typographic errors are mostly errors related to the keyboard; e.g., substitution or transposition of two letters because their keys are close on the keyboard. Phonetics algorithms are used to reduce Cognitive errors.

With @ang-st we analyzed the current search method used by aleph and we found out that it can be improved.

With the use of phonetic algorithms it will allow users to expand the search output in case there is a typo in the name of the person researched or if the user doesn't know how to spell the name of the person correctly.

Theses features can be added to the advanced search with a check-boxes or with a new field or something else.

Examples : Soundex algorithm, NYSIIS, Metaphone algorithm, ect...

pudo commented 3 years ago

Hello @TheophileDiot, sorry for the slow response on this - I didn't realise you guys were actually offering to build this :) I don't have very much experience with using phonetic algorithms in practice, so excuse me if some of the thoughts below are a bit ignorant:

I'm really curious what kind of impact using phonetic filters is going to have for subjective result quality. It could be a cool way to really broaden out a query when the more strict forms don't return results....

ang-st commented 3 years ago

Hi @pudo ,

Thanks for those great insights !

Assuming i'm not mistaken regarding point 1 and 2, i guess we could start to play "manually" with them to get acquainted and then do some informal testing to assess subjective result quality

pudo commented 3 years ago

hey @ang-st! I'll take your points in order:

ghost commented 3 years ago

@ang-st