CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

Inferred rank from name parser causing differences with rankSimilarity #1316

Open djtfmartin opened 5 months ago

djtfmartin commented 5 months ago

Inferred rank from name parser is causing differences with the rankSimilarity measure used in in creating the confidence numeric value in the ported API.

The net result of this is differences in the confidence value associated with matches, which in turn creates differences between the ported API (matching-ws) and the current GBIF API.

As an example, Holothuroidea is inferred to be a superfamily from the structure of the name.

Current GBIF API:

{
  "usageKey": 222,
  "scientificName": "Holothuroidea",
  "canonicalName": "Holothuroidea",
  "rank": "CLASS",
  "status": "ACCEPTED",
  "confidence": 85,
  "note": "Similarity: name=100; authorship=0; classification=-2; rank=-19; status=1; nextMatch=5",
  "matchType": "EXACT"
}

Ported API - which infers superfamily, doesnt match due to low confidence value of 65 (threshold=80)

"diagnostics": {
  "matchType": "EXACT",
  "confidence": 69,
  "status": "ACCEPTED",
  "note": "Similarity: name=100; authorship=0; classification=-2; rank=-35; status=1; score=64; nextMatch=5"
}
djtfmartin commented 5 months ago

For now, I've changed to only use inferred ranks from the parsed name for binomials and trinomials. This has fixed a number of matches (over 200) in the test set.

For example in the issue, the response now looks like this:

  "usageKey": 222,
  "scientificName": "Holothuroidea",
  "canonicalName": "Holothuroidea",
  "rank": "CLASS",
  "status": "ACCEPTED",
  "confidence": 94,
  "note": "Similarity: name=100; authorship=0; classification=-2; rank=0; status=1; score=99; nextMatch=5"
djtfmartin commented 5 months ago

A lot of the changes in rankSimilarity are to do with the GBIF nameparser Rank enum have more entries (116 vs 75). This affects the rankSimilarity score as the ordinal is used to calculate the difference. Hence this has affect rankSimilarity scores.