FreeUKGen / MyopicVicar

MyopicVicar (short-sighted clergyman!) is an open-source genealogy record database and search engine. It powers the FreeREG database of parish registers, the FreeCEN database of census records, the next version of FreeBMD database of Civil Registration indexes and other Genealogical applications.
45 stars 15 forks source link

Non-soundex fuzzy matching #608

Open benwbrum opened 8 years ago

benwbrum commented 8 years ago

In the past, we've discussed using alternatives to SOUNDEX like metaphone and double metaphone for fuzzy matching.

I believe that we should consider these for the work on the internals of the search engine.

Sherlock21 commented 8 years ago

i was surprised to discover ( from another report) that Soundex only works on the characters AFTER the first. So the researcher has to get the first letter correct. OR know that this is how Soundex is limited. It would be good if you can get over that little hidden snag please.

dougkdev commented 8 years ago

While working on search support for multiple-part surnames, I discovered that some of our soundex codings on forenames are null in the search records. I see it when there are multiple surnames with the space occurring before all 3 soundex digits are filled. "Ann Elizabeth", for example would be null, but the individual parts after separation "Ann" and "Elizabeth" are coded ok. Not sure how important that is to fix, but we should at least be aware of it.

On Mon, Oct 19, 2015 at 5:43 AM, Sherlock21 notifications@github.com wrote:

i was surprised to discover ( from another report) that Soundex only works on the characters AFTER the first. So the researcher has to get the first letter correct. OR know that this is how Soundex is limited. It would be good if you can get over that little hidden snag please.

— Reply to this email directly or view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/608#issuecomment-149191413 .

dougkdev commented 8 years ago

We could either ignore the space and continue coding the sounds of the second name, or we could assume the encoding should stop at the space, in which case we don't need to do anything since the soundex for "Ann" will already be included in the list of search name soundex encodings.

On Tue, Nov 3, 2015 at 1:48 PM, Doug Kennard doug.kennard@gmail.com wrote:

While working on search support for multiple-part surnames, I discovered that some of our soundex codings on forenames are null in the search records. I see it when there are multiple surnames with the space occurring before all 3 soundex digits are filled. "Ann Elizabeth", for example would be null, but the individual parts after separation "Ann" and "Elizabeth" are coded ok. Not sure how important that is to fix, but we should at least be aware of it.

On Mon, Oct 19, 2015 at 5:43 AM, Sherlock21 notifications@github.com wrote:

i was surprised to discover ( from another report) that Soundex only works on the characters AFTER the first. So the researcher has to get the first letter correct. OR know that this is how Soundex is limited. It would be good if you can get over that little hidden snag please.

— Reply to this email directly or view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/608#issuecomment-149191413 .

benwbrum commented 8 years ago

I suspect that, if we're handling "Ann" and "Elizabeth" correctly, we just shouldn't even add a name pair for nil, "surname" here.

On Tue, Nov 3, 2015 at 2:52 PM, dougkdev notifications@github.com wrote:

We could either ignore the space and continue coding the sounds of the second name, or we could assume the encoding should stop at the space, in which case we don't need to do anything since the soundex for "Ann" will already be included in the list of search name soundex encodings.

On Tue, Nov 3, 2015 at 1:48 PM, Doug Kennard doug.kennard@gmail.com wrote:

While working on search support for multiple-part surnames, I discovered that some of our soundex codings on forenames are null in the search records. I see it when there are multiple surnames with the space occurring before all 3 soundex digits are filled. "Ann Elizabeth", for example would be null, but the individual parts after separation "Ann" and "Elizabeth" are coded ok. Not sure how important that is to fix, but we should at least be aware of it.

On Mon, Oct 19, 2015 at 5:43 AM, Sherlock21 notifications@github.com wrote:

i was surprised to discover ( from another report) that Soundex only works on the characters AFTER the first. So the researcher has to get the first letter correct. OR know that this is how Soundex is limited. It would be good if you can get over that little hidden snag please.

— Reply to this email directly or view it on GitHub < https://github.com/FreeUKGen/MyopicVicar/issues/608#issuecomment-149191413

.

— Reply to this email directly or view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/608#issuecomment-153484705 .

dougkdev commented 8 years ago

I added a check so the soundex name triple is only added to search record if soundex isn't nil (on both first and last). Checked into wildcard branch.

On Tue, Nov 3, 2015 at 1:59 PM, Ben W. Brumfield notifications@github.com wrote:

I suspect that, if we're handling "Ann" and "Elizabeth" correctly, we just shouldn't even add a name pair for nil, "surname" here.

On Tue, Nov 3, 2015 at 2:52 PM, dougkdev notifications@github.com wrote:

We could either ignore the space and continue coding the sounds of the second name, or we could assume the encoding should stop at the space, in which case we don't need to do anything since the soundex for "Ann" will already be included in the list of search name soundex encodings.

On Tue, Nov 3, 2015 at 1:48 PM, Doug Kennard doug.kennard@gmail.com wrote:

While working on search support for multiple-part surnames, I discovered that some of our soundex codings on forenames are null in the search records. I see it when there are multiple surnames with the space occurring before all 3 soundex digits are filled. "Ann Elizabeth", for example would be null, but the individual parts after separation "Ann" and "Elizabeth" are coded ok. Not sure how important that is to fix, but we should at least be aware of it.

On Mon, Oct 19, 2015 at 5:43 AM, Sherlock21 notifications@github.com wrote:

i was surprised to discover ( from another report) that Soundex only works on the characters AFTER the first. So the researcher has to get the first letter correct. OR know that this is how Soundex is limited. It would be good if you can get over that little hidden snag please.

— Reply to this email directly or view it on GitHub <

https://github.com/FreeUKGen/MyopicVicar/issues/608#issuecomment-149191413

.

— Reply to this email directly or view it on GitHub < https://github.com/FreeUKGen/MyopicVicar/issues/608#issuecomment-153484705

.

— Reply to this email directly or view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/608#issuecomment-153486266 .