gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Diagnostic: Missing names on record #973

Closed timrobertson100 closed 11 months ago

timrobertson100 commented 11 months ago

This record shows as incertae sedis but the lookup should find the species.

I'll investigate, cc @mdoering

timrobertson100 commented 11 months ago

The lookup cache contains:

hbase(main):001:0> scan 'name_usage_kv', { FILTER => "RowFilter(=, 'substring:Lissotarsus reticulata')" }
ROW                                                       COLUMN+CELL                                                                                                                                                             
 6|||||||||Lissotarsus reticulata Chaudoir, 1842|||||     column=v:j, timestamp=1696041217943, value={"synonym":true,"usage":{"key":9355155,"name":"Lissotarsus reticulatus Chaudoir, 1842","rank":"SPECIES"},"acceptedUsage":{"ke
                                                          y":7811407,"name":"Platyderus reticulatus (Chaudoir, 1842)","rank":"SPECIES"},"classification":[{"key":1,"name":"Animalia","rank":"KINGDOM"},{"key":54,"name":"Arthropod
                                                          a","rank":"PHYLUM"},{"key":216,"name":"Insecta","rank":"CLASS"},{"key":1470,"name":"Coleoptera","rank":"ORDER"},{"key":3792,"name":"Carabidae","rank":"FAMILY"},{"key":3
                                                          260555,"name":"Platyderus","rank":"GENUS"},{"key":7811407,"name":"Platyderus reticulatus","rank":"SPECIES"}],"diagnostics":{"matchType":"FUZZY","confidence":99,"status"
                                                          :"SYNONYM","lineage":[],"alternatives":[]},"iucnRedListCategory":{"category":"NOT_EVALUATED","code":"NE","scientificName":"Lissotarsus reticulatus Chaudoir, 1842","taxo
                                                          nomicStatus":"SYNONYM","acceptedName":"Platyderus reticulatus (Chaudoir, 1842)"},"issues":[]}                                                                           
1 row(s) in 45.7350 seconds

Formatted for readability:

Date is Saturday, September 30, 2023 2:33:37.943 AM

{
  "synonym":true,
  "usage":{
    "key":9355155,
    "name":"Lissotarsus reticulatus Chaudoir, 1842",
    "rank":"SPECIES"
  },
  "acceptedUsage":{
    "key":7811407,
    "name":"Platyderus reticulatus (Chaudoir, 1842)",
    "rank":"SPECIES"
  },
  "classification":[
    {
      "key":1,
      "name":"Animalia",
      "rank":"KINGDOM"
    },
    {
      "key":54,
      "name":"Arthropoda",
      "rank":"PHYLUM"
    },
    {
      "key":216,
      "name":"Insecta",
      "rank":"CLASS"
    },
    {
      "key":1470,
      "name":"Coleoptera",
      "rank":"ORDER"
    },
    {
      "key":3792,
      "name":"Carabidae",
      "rank":"FAMILY"
    },
    {
      "key":3260555,
      "name":"Platyderus",
      "rank":"GENUS"
    },
    {
      "key":7811407,
      "name":"Platyderus reticulatus",
      "rank":"SPECIES"
    }
  ],
  "diagnostics":{
    "matchType":"FUZZY",
    "confidence":99,
    "status":"SYNONYM",
    "lineage":[

    ],
    "alternatives":[

    ]
  },
  "iucnRedListCategory":{
    "category":"NOT_EVALUATED",
    "code":"NE",
    "scientificName":"Lissotarsus reticulatus Chaudoir, 1842",
    "taxonomicStatus":"SYNONYM",
    "acceptedName":"Platyderus reticulatus (Chaudoir, 1842)"
  },
  "issues":[

  ]
}

The lookup appears to have worked, and been cached as expected but wasn't included in the interpreted record. Reprocessing yields the same result.

timrobertson100 commented 11 months ago

With @muttcg help, we have diagnosed this, and it's behaving as intended @mdoering

It's dropping into this line

      if (usageMatch == null || isEmpty(usageMatch) || checkFuzzy(usageMatch, identification)) {
        // "NO_MATCHING_RESULTS". This
        // happens when we get an empty response from the WS
        addIssue(tr, TAXON_MATCH_NONE);
        tr.setUsage(INCERTAE_SEDIS);
        tr.setClassification(Collections.singletonList(INCERTAE_SEDIS));
      }

The web service is returning a fuzzy match (reticulata vs reticulatus) and as we described in this issue if there are no higher taxa on the record (there aren't in this case) we don't assume a fuzzy match is correct as it made too many mistakes. This record needs a higher taxon added to match.

I don't think we want to change this behavior - agree?

timrobertson100 commented 11 months ago

As it happens, this is a narrowly scoped dataset (titled "Coleoptera...") so we could add a default of kingdom = Animalia in the registry which would at least improve this dataset.

mdoering commented 11 months ago

Ah, that makes sense. It would be great to understand why that has happened from a user perspective, but yes we should keep it. And for sure add a default classification to the dataset. I see this is done already.

timrobertson100 commented 11 months ago

We could add more, but I'll start conservatively

timrobertson100 commented 11 months ago

Animalia was enough for this example. but there were records being interpreted as Fungi as well, so I added Animalia / Arthropoda / Insecta and that has put this into a better shape.