JakeBayer / FuzzySharp

C# .NET fuzzy string matching implementation of Seat Geek's well known python FuzzyWuzzy algorithm.
MIT License
645 stars 80 forks source link

Extract method with `(string query, IEnumerable<T> choices)` signature #46

Open will-molloy opened 1 year ago

will-molloy commented 1 year ago

Currently the Process.Extract... methods have 2 signatures:

1: string query, IEnumerable<string> choices:

  public static IEnumerable<ExtractedResult<string>> ExtractAll(
      string query, 
      IEnumerable<string> choices, 
      Func<string, string> processor = null, 
      IRatioScorer scorer = null,
      int cutoff = 0)

and 2: T query, IEnumerable<T> choices:

  public static IEnumerable<ExtractedResult<T>> ExtractAll<T>(
      T query, 
      IEnumerable<T> choices,
      Func<T, string> processor,
      IRatioScorer scorer = null,
      int cutoff = 0)

In my case the user enters a string to filter a List<T> of objects.

I can use 1 if I convert to string first, collect the results to HashSet<string>, and use that to filter the original List<T>:

  public static IEnumerable<Dto> Example1(string query, IEnumerable<Dto> list)
  {
      var set = Process.ExtractAll(query, list.Select(x => x.Name))
          .Select(result => result.Value)
          .ToImmutableHashSet();
      return list.Where(dto => set.Contains(dto.Name));
  }

Or 2 if I create a dummy T query object from the string entered by the user:

  public static IEnumerable<Dto> Example2(string query, IEnumerable<Dto> list)
  {
      var dummy = new Dto(query);
      return Process.ExtractAll(dummy, list, dto => dto.Name)
          .Select(result => result.Value);
  }

The 2nd one isn't that bad... but tbh I struggle to think of a case where you would have a T query? Especially since the Func<T, string> processor is required for this overload.

So I think a signature like this would be useful:

  public static IEnumerable<ExtractedResult<T>> ExtractAll<T>(
      string query, 
      IEnumerable<T> choices,
      Func<T, string> processor,
      IRatioScorer scorer = null,
      int cutoff = 0)

To be used like:

public static IEnumerable<Dto> Example3(string query, IEnumerable<Dto> list)
{
    return Process.ExtractAll(query, list, dto => dto.Name)
        .Select(result => result.Value);
}
will-molloy commented 1 year ago

Also a problem with (2) is you may want to preprocess the dataset but not the users input.

So you may have to go with 1 which has unnecessary hashset filter (really bad with large dataset).

ycherkes commented 3 months ago

@will-molloy If it's still actual, you can use my fork where I implemented such method https://www.nuget.org/packages/Raffinert.FuzzySharp

I also created a PR to this repository.