dijxtra / simplepyged

A simple Python GEDCOM parser
GNU General Public License v3.0
39 stars 19 forks source link

Levenshtein name comparisons #7

Open BrJohan opened 10 years ago

BrJohan commented 10 years ago

I would like to suggest a possibility to compare persons names by using the Levenshtein Distance algorithm. See http://en.wikipedia.org/wiki/Levenshtein_distance

My genealogical 'research' is primarily related to Sweden. Very often persons have their name spelled a little different in various sourcedocuments.

Example: Kristina - Cristina - Christina - Chrestina - Christine

Using this suggested algorithm and allowing some (fairly small) maximum distance would be most helpful when trying to find duplicate persons in my database.

MinchinWeb commented 9 years ago

I think you might be better off with Soundex or something similar. Soundex assigns a value to a word, such that words that are pronounced the same as assigned the same value.

The fuzzy library ( https://pypi.python.org/pypi/Fuzzy ) might be a good place to start.