MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

Analyse precision recall curve #59

Open KoenLoeffen opened 1 year ago

KoenLoeffen commented 1 year ago

I have two questions:

  1. The precision-recall curve is a trade off between the min similarity and the percentage matched. So in the ideal case you want both the precision as the recall as high as possible. However I found out in my results that the model with the highest precision and recall isn't always the best. Am I missing something?
  2. How would I set the optimal threshold for the similarity? Is this also based on the precision recall curve?
MaartenGr commented 1 year ago

The precision-recall curve is a trade off between the min similarity and the percentage matched. So in the ideal case you want both the precision as the recall as high as possible. However I found out in my results that the model with the highest precision and recall isn't always the best. Am I missing something?

The precision-recall curve is an approximation as we do not have the ground-truth available. We ideally still want this to be as high as possible but it would still be an approximation.

How would I set the optimal threshold for the similarity? Is this also based on the precision recall curve?

Yes, that is the main purpose of the precision-recall curve as defined in PolyFuzz. It helps you understand what the threshold would be to get a certain amount of matches and the relative accuracy of the results.