gbif / backbone-feedback

2 stars 0 forks source link

Many marine Siphonophora occurrences matched to milipedes #192

Open gbif-portal opened 1 year ago

gbif-portal commented 1 year ago

Many marine Siphonophora occurrences matched to milipedes

Many marine Siphonophora occurrences are only given as genus Siphonophora which is a homonym and the accepted name are milipedes. From 11142 records over 10k are marine and therefore not milipedes.


User: See in registry - Send email System: Safari 15.4.0 / Mac OS X 10.15.7 Referer: https://www.gbif.org/occurrence/taxonomy?taxon_key=1011357 Window size: width 1276 - height 837 API log&_a=(columns:!(_source),filters:!(),index:'3390a910-fcda-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) Site log&_a=(columns:!(_source),filters:!(),index:'5c73f360-fce3-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) System health at time of feedback: OPERATIONAL

mdoering commented 1 year ago

See https://mobile.twitter.com/rdmpage/status/1584121572045905920 and this helpdesk mail:

GBIF Diplopoda went crazy: https://myriatrix.myspecies.info/myriatrix/diplopoda The problem occurs at the level of genus Siphonophora Brandt, 1837: https://myriatrix.myspecies.info/myriatrix/siphonophora Please, isolate any marine "Siphonophora" occurrences away from Diplopoda. Check for potential higher rank homonyms, e.g., "order Siphonophora". If that is the exact problem, prune the marine occurrences and reconnect them under order Siphonophorae (Animalia: Cnidaria: Hydrozoa): https://www.marinespecies.org/aphia.php?p=taxdetails&id=1371

I am not concerned that much about the effect in Diplopoda downloads, as the error will be noticed and cleaned. I am more worried about marine data and Cnidaria data downloads, where the error may go unnoticed. The issue was reported through a Myriapodology mailing list and I would like to let them know as soon as it is fixed.

mdoering commented 1 year ago

ALA and GBIF have independently (different taxonomic lookups) confused Siphonophora (millipede genus) with Siphonophora (colonial jellyfish class).

image

image

mdoering commented 1 year ago

many marine datasets involved:

image

There are actually 3 Siphonophora genera in the Backbone: https://www.gbif.org/species/search?q=Siphonophora&rank=GENUS&qField=SCIENTIFIC

image

The accepted Diplopoda, a Diptera synonym and a doubtful Hemiptera name.

An example of a marine record has nothing but scientificName= Siphonophora. No further classification to disambiguate.

We can add Siphonophora as a synonym to the backbone (see WoRMS) to potentially catch those records, but @ahahn-gbif it needs at least some indication which Siphonophora is meant. The marine ones should have rank=order at least to disambiguate them from the other genera!

mdoering commented 1 year ago

It would really be desirable to have the name matching be environment aware, i.e. you could supply an optional parameter marine, freshwater, terrestrial to disambiguate taxa of those environments only. COL has started to capture this, but it is not complete. It seems by just keeping sea vs land distinct we would have a very valuable additional hint for the matching

https://github.com/CatalogueOfLife/backend/issues/1169

mdoering commented 1 year ago

https://www.gbif.org/species/203130334

mdoering commented 1 year ago

@ahahn-gbif @timrobertson100 The plankton datasets involved here contain all kinds of taxonomic groups, so it is impossible to provide dataset wide configurations that set the classification and give matching hints:

image

Assuming the dataset publishers also do not know more we are pretty much stuck in this situation with current tooling.

What if we had negative dataset wide configurations for taxonomic coverages? If we could declare that this dataset never contains any Diplopoda? The name matching could then receive some new exclusion filter parameter that would allow to snap to the right Siphonophora. Such a config would likely help in a lot of cases when we receive bad matching reports and should not be terribly difficult to implement

timrobertson100 commented 1 year ago

Makes sense to implement, although here I suggest we start by asking the publishers to provide some higher taxa if they can.

With the forthcoming backbone, this will at least be an ambiguous lookup and snap to Animalia I presume, right?

ahahn-gbif commented 1 year ago

One of the cases where a more prominent / higher severity flag for doubtful taxon matching would be very helpful. Finding concerned datasets, contacting publishers and explaining the need for adding higher taxonomy is certainly a possible solution, but it is also a moving target that is very hard to get/stay on top of. If we could keep track of cases where existing homonymy has a potential of causing particularly bad problems in occurrence record interpretation (cross kingdom or cross other higher groups), and where occurrences come with insufficient (<- to be defined...) higher taxon information, should we raise that to some issue flag of "concern"? Any false positive could still be removed by adding higher taxon information anyway, and it would be much more visible as an issue otherwise.

Archilegt commented 1 year ago

See also how one of four Siphonophora treatments displays no data. https://www.gbif.org/species/1011357/treatments Is that a Plazi treatment? How many broken treatments like that are there in GBIF?

image

mdoering commented 1 year ago

With the forthcoming backbone, this will at least be an ambiguous lookup and snap to Animalia I presume, right? That would be good, but I am not certain it will. There is just one properly accepted name out of then 4 Siphonophora names. I think the accepted one will be preferred in such cases. We could change that, but it would badly effect lots of good matches then. And the more rare synonyms we collect in the future the bigger this problem then becomes. So I still think snapping to the single accepted name is best.

I honestly see only 3 options right now:

@ahahn-gbif @timrobertson100 my favorite would be any of the later 2 - should we do this? I could get this out for testing quick.

timrobertson100 commented 1 year ago

I'm in favor of 2 as well, and the machine tags are ideally suited to this.

I'm not sure we expect the matching service to have any knowledge of datasets, nor the performance hit of pulling dataset metadata. Perhaps the match should allow the caller to pass in an exclusionScope as a parameter, and the pipelines (which does know about the dataset it is processing) pull the tags and add it? (you may have meant that)

mdoering commented 1 year ago

Yes, that was exactly what I was thinking. Pipelines would consult the registry to find exclusions (or default classifications) that are then passed to the matching

timrobertson100 commented 1 year ago

That seems like a reasonable approach to me. Let's go for that.

One alternative I can think of is the ability to inject custom interpretation. Imagine a machine tag allowing the registration of a function (e.g. JS using nashorn):

if (occurrence.scientificName === 'Siphonophora' && occurrence.phylum = null) {
  occurrence.phylum='Cnidaria'
}

This would open the door to selectively fixing data and is something we've pondered before but discounted as a step too far and could be dangerous.

Archilegt commented 1 year ago

@timrobertson100, you would then need an additional tag along the lines of occurrence.phylum.value='inferred'

mdoering commented 1 year ago

the v2 matching now includes a exclude parameter that takes (higher) nub keys. Any potential matches against those will be removed from results.

mdoering commented 1 year ago

@fmendezh @muttcg would it be simple to support the new exclude parameter in pipelines and use a new dataset tag to configure them per dataset?

mdoering commented 1 year ago

There is a first test version running on the new backbone:

This yields millipeds as usual if nothing but the name Siphonophora is given: http://backbonebuild-vh.gbif.org:9000/species/match2?verbose=true&name=Siphonophora

But you can exclude millipedes now: http://backbonebuild-vh.gbif.org:9000/species/match2?verbose=true&name=Siphonophora&exclude=361

... which unfortunately only snaps to Animalia as there are many other Sinophora names in the nub still :(