derekantrican / MountainProject

A scraper and reddit bot for the website MountainProject.com
56 stars 5 forks source link

[BOT] Fuzzy String Matching #7

Open derekantrican opened 5 years ago

derekantrican commented 5 years ago

For instance, in https://www.reddit.com/r/climbing/comments/715awf/red_rock_season_is_back_cruisin_up_yak_crack_511c/ the user mentioned that they are climbing "Yak Crack". While there are some results with that name, the real result is actually "Yaak Crack". We should implement fuzzy query matching so that "Yaak Crack" would also come up in the results list.

derekantrican commented 5 years ago

We could solve this by upgrading the MountainProjectDataSearch.StringMatch function to something similar to:

private static bool StringMatch(string inputString, string targetString, bool caseInsensitive = true)
{
    string input = inputString;
    string target = targetString;

    if (caseInsensitive)
    {
        input = input.ToLower();
        target = target.ToLower();
    }

    if (target.Contains(input))
        return true;
    else if (Levenshtein(target, input) <= 3)
        return true;
}

This would check to see if there are 3 or fewer changes to "fix" the string. We can adjust this limit as needed, but not too much as a large limit will start matching other unrelated items

derekantrican commented 2 years ago

https://github.com/Turnerj/Quickenshtein is a C# Levenshtein implementation that should be pretty quick

derekantrican commented 2 years ago

This will be a bit more complicated than the snippet above. While that works for https://old.reddit.com/r/climbing/comments/uq7ej2/sent_my_first_v8_thin_lizzy_in_joshua_tree it doesn't work for the "Yak Crack" example. If we're using a levenshtein distance of <= 3, then that means "Yak Crack" can match just about any "___ Crack" route (which means that in our cutoff of !searchResult.IsEmpty() && searchResult.AllResults.Count < 75, 75 is too low)