TechForPalestine / palestine-datasets

The human toll of Israel's ongoing genocide in names & numbers. Use the data from our APIs to tell their story.
https://data.techforpalestine.org
Other
192 stars 22 forks source link

Phonetic search for names #107

Closed jkoshy closed 4 months ago

jkoshy commented 6 months ago

The ability to search for names phonetically (e.g., with Soundex) would make these databases much more accessible to non-Arabic speakers.

I found a mapping of Arabic characters to Soundex codes here: https://www.codeproject.com/Articles/26880/Arabic-Soundex/ Could this perhaps be used as a starting point?

sterlingwes commented 6 months ago

I'm not familiar with soundex, are you suggesting someone should be able to say the name into their microphone and have it pull up the closest arabic match?

sterlingwes commented 6 months ago

If I understand the problem, I think the simplest approach here would be to come up with a dictionary of similar name spellings and have the existing translated English name search tool treat them as aliases.

jkoshy commented 6 months ago

@sterlingwes Phonetic ("fuzzy") searches are text based; there is no audio involved.

Many open-source databases today offer phonetic/fuzzy text searches (e.g PostgreSQL), although for Latin text.

sterlingwes commented 6 months ago

Got it, thanks for clarifying. I still think capturing alternate spellings or including the transliterated variant in the existing frontend fuzzy search would be a quicker way to see an improvement here. Maybe not as robust as what you propose but I don't have time to implement a whole backend search feature.

I'll leave this issue open in case anyone is interested in digging into it on their own time.

Our site deployment is entirely static so we'd need to do work on the deployment side of things to allow for a search backend that can operate on the names list. Before we even get there though we should test a prototype of this kind of search with a few test cases to see if it improves search. We'll need an arabic speaker to help corroborate that the new search is substantially better before putting in the work of deploying & maintaining this new piece of infra.

If someone were interested in improving search but pursuing the scrappier method I noted above, we'd need to come up with a method for looking up a list of alternate name spellings for a given name translation, for example this martyr I'm told preferred to go by Soliman, not Suleiman, but another spelling is also Sleiman. Again, would need an Arabic speaker to advise best approach on this. An automated approach to this could be to bring back the library we used previously but not for surfacing any names, just for another token for the fuzzy search: arabic-to-en. It has a very basic transliteration mapping fallback that might approximate phonetics (but definitely needs testing).

jkoshy commented 6 months ago

The problem with using alternate lists of Latin spellings is that the number of transliterations needed increases combinatorially.

A back-of-the-envelope computation: sites on the 'net with lists of baby names list about a thousand popular Arabic first names. Each name seem to have 3 to 4 syllables, and each syllable could be encoded in Latin script in 2 to 3 ways: e.g., "Zuleikha" could use "Zu" or "Su", "lei" "lay" "ley" or "li", "kha" or "ka", and so on.

So, for just for popular first names, we would have to maintain:

1000 names ✕ 3.5 syllables on average ✕ 2.5 encoding choices per syllable on average ≅ 8750 alternate Latin spellings.

Even with the 8.7✕ increase of maintenance effort such a list would still be incomplete - it would not cover surnames or the less popular first names.

sterlingwes commented 6 months ago

I'm not sure I follow - the library I linked would do transliteration for the original arabic name, which is a separate tactic from the "alternate spellings" approach. You wouldn't have to transliterate the alternate latin spellings, they're just another alias for the existing translated english name. If your concern is that someone might not be able to type the alternate spellings properly I think the existing fuzzy search algo we're using on the frontend will accommodate that reasonably well. Again, these are all just broad strategies, we'd need some prototyping and concrete scenarios to test against.

1palestine commented 4 months ago

Could utilize Elasticsearch/OpenSearch. It offers great tokenizers for stuff like this.

https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html

It is what I would go with if you want to have a perfect search. Fuzzy matches, weights, phonetics, synonyms for names, it offers a whole lot. (Example list for names: https://gist.github.com/harlow/68ab803ba044c99190ddb4b7a1ecce72)

Alternatively, you can also just use Lucene Index which may be more cost-effective. (https://stackoverflow.com/questions/38599692/how-to-implement-a-phonetic-search-using-lucene)

Implementing the backend would be a little bit of work, but not impossible. The only concern I'd have with the search is it could be DDOS'd. Being that this site is static, I do not believe there exists any mitigation for DDOS attacks. That is something we'd want to also ponder if we set up a backend for searches.

Also for this bit:

for example this martyr I'm told preferred to go by Soliman, not Suleiman, but another spelling is also Sleiman.

Elasticsearch can handle stuff like this super nicely

1palestine commented 4 months ago

But yea, Phonetic searches cannot be done statically. It would require a backend that allows these various datasets to work together. A raw HTTP request for JSON won't cut it.

sterlingwes commented 4 months ago

Thanks @1palestine that's good to know! I'm not sure there's a big need for this yet, but if we get that feedback I'd certainly be open to exploring more complicated infra to achieve improvements here

sterlingwes commented 4 months ago

Confirmed recently after adding some tracking to the site that the current search tool isn't used much (clicks amount to <0.45% of impressions on the biggest day). This is with a week's worth of tracking, so will keep an eye on it and reopen if it seems like something we should revisit & pitch to the org for funding / hosting