gbif / checklistbank

GBIF Checklist Bank
Apache License 2.0
31 stars 14 forks source link

confidence score is 100 when no species are matched. #253

Closed Clara-liu closed 1 year ago

Clara-liu commented 1 year ago

As the title states, when doing a name search and no matching information is found, the API returns a confidence score of 100 rather than 0. For example: curl -X GET "https://api.gbif.org/v1/species/match?scientificName=Heteropterinae" returns {"confidence":100,"note":"No match because of too little confidence","matchType":"NONE","synonym":false}

This makes filtering match results with confidence score difficult.

ManonGros commented 1 year ago

@Clara-liu, what would you recommend? Should it be 0 when there is no match?

Would filtering by matchType first help?

Note that in the documentation name is used instead of scientificName: https://api.gbif.org/v1/species/match?verbose=true&name=Heteropterinae

Clara-liu commented 1 year ago

@ManonGros Thank you for the swift reply.

Yes, I believe that confidence should be 0 when there is no match.

matchType includes a large variation of match level. As you can see here, results with the confidence of 18 to 85 are all categorised as FUZZY in terms of matchType. I have also encountered examples with 96 confidence that is categorised as FUZZY in matchType

I will use name from now on thank you!

MortenHofft commented 1 year ago

Personally I find that the "100% confident that there is no match" makes sense. But either way we probably shouldn't change this. It is a breaking change that might break other peoples scripts if they use the API. Just like you, they might have logic that use that number. Any changes would have to go with a change in API version.

I will move the suggestion to checkbank repository

Clara-liu commented 1 year ago

@MortenHofft I originally thought the same but later realised it contradicts the note saying “no match because of too little confidence”.

I’ll have to add some logic to circumvent this for now. Thanks!

mdoering commented 1 year ago

Like @MortenHofft said it is intended to say 100% sure we have no match. We do implement a cutoff, so you do not see matches with very low confidence ever. The Oenanthe example you gave above is listing discarded alternative options with low confidence because you asked for a verbose match. Checking the match type is clearly the first and most important thing a client should do. The confidence is a score based on various parameters, not just how close the name is. The name, rank, author, status & classification similarity and the presence/absence of at least one other close match are its constituents that you can see in verbose matching notes:

http://api.gbif.org/v1/species/match?name=Abies%20alba&verbose=true http://api.gbif.org/v1/species/match?name=Abies%20albas&verbose=true http://api.gbif.org/v1/species/match?name=Abies%20albas&family=Pinaceae&verbose=true