WorldHistoricalGazetteer / whg3

Version 3 beta
BSD 3-Clause "New" or "Revised" License
4 stars 4 forks source link

Phonetic Toponym Matching #299

Open docuracy opened 4 months ago

docuracy commented 4 months ago

Tomer Sagi et al: Utilizing Phonetic Similarity for Cross-source and Cross-language Toponym Matching - a Benchmark and Prototype (Preprint)

docuracy commented 2 months ago

We might use a Dockerised instance of Phonetisaurus to generate an additional database table to store phonemes of our existing (and any new) Place records, and to add these to our Elastic index. On-the-fly searching would then involve using the same Docker instance to process the query, and rely on Elastic's similarity-matching for phonemes weighted together with all of the other facets that we already commonly use such as geographic distance.

docuracy commented 2 months ago

Tomer Sagi: That could actually work, although if you are working on a database-based approach, I would reccomend considering a vector database and storing the mBERT embedding of the name as another way to generate candidates. In general you can seperate between the candidate search and the matching process. Collect candidates from phoneme similarity, exact matches, transliterations, etc. and then use a matching system to priority sort the candidates and throw out obvious non-matches (e.g., places that are 1000 KM from each other).