Open docuracy opened 4 months ago
We might use a Dockerised instance of Phonetisaurus to generate an additional database table to store phonemes of our existing (and any new) Place
records, and to add these to our Elastic index. On-the-fly searching would then involve using the same Docker instance to process the query, and rely on Elastic's similarity-matching for phonemes weighted together with all of the other facets that we already commonly use such as geographic distance.
Tomer Sagi: That could actually work, although if you are working on a database-based approach, I would reccomend considering a vector database and storing the mBERT embedding of the name as another way to generate candidates. In general you can seperate between the candidate search and the matching process. Collect candidates from phoneme similarity, exact matches, transliterations, etc. and then use a matching system to priority sort the candidates and throw out obvious non-matches (e.g., places that are 1000 KM from each other).
Tomer Sagi et al: Utilizing Phonetic Similarity for Cross-source and Cross-language Toponym Matching - a Benchmark and Prototype (Preprint)