bakape / hydron

media tagger and organizer
GNU Lesser General Public License v3.0
95 stars 9 forks source link

Fuzzy tag match #59

Open DonaldTsang opened 5 years ago

DonaldTsang commented 5 years ago

Matching tags with similar pronunciation or spelling Similar to https://gitgud.io/Dizmal/borehole

bakape commented 5 years ago

Accounting for spelling mistakes would lead to too much noise. Use substring matching.

DonaldTsang commented 5 years ago

@bakape I would recommend doing research on String metrics https://en.wikipedia.org/wiki/String_metric and that there are many algorithms that account for spelling mistakes... but then again a simpler way would be to use phonetic encoding https://en.wikipedia.org/wiki/Phonetic_encoding which reduces complexity (assuming you know what most tags look like phonetically)

bakape commented 5 years ago

Thanks for the suggestions, but I only intend to use the facilities available in the database system. Whatever I'd pick would also have to be indexed off of the tags in the DB. Substring matching fits this use case.

On Mon, 11 Mar 2019, 20:08 Donald Tsang, notifications@github.com wrote:

@bakape https://github.com/bakape I would recommend doing research on String metrics https://en.wikipedia.org/wiki/String_metric and that there are many algorithms that account for spelling mistakes... but then again a simpler way would be to use phonetic encoding https://en.wikipedia.org/wiki/Phonetic_encoding which reduces complexity (assuming you know what most tags look like phonetically)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bakape/hydron/issues/59#issuecomment-471656520, or mute the thread https://github.com/notifications/unsubscribe-auth/AHfPsDEXU_aMQYr8fgqdH2N9RLsgBoEyks5vVpuwgaJpZM4ZZ6mB .

DonaldTsang commented 5 years ago

@bakape in this case, to avoid adding string metric functions, phonetic-encoded substrings would be useful, all that is required is to add an extra column in the tag database to include a phonetic encoding.