gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Taxanomic matching #934

Closed derek-mba closed 1 year ago

derek-mba commented 1 year ago

When a record is submitted via the IPT that contains a valid scientificNameID and a scientificName, the scientificNameID should be considered authoritative.

See https://discourse.gbif.org/t/millipedes-in-the-ocean/3991

The core of the problem here is that GBIF is using the ScientificName instead of the ScientificNameId (in this case it's Aphia ID). The latter should be definitive, and is correct on the MBA records. ScientificName should only be used if ScientificNameId is not present. It's true, that for some reason our ScientificName didn't match the ScientificNameId, but OBIS harvests these same records and gets the classifications right (I am a little surprised that EurOBIS, which has very stringent checking of taxonomy, had not rejected these records because the ScientificName hadn't matched the Aphia, but I can hardly blame them for our bad data!).

As for GBIF "fixing" the data, please don't. We're always ready to fix our own once we know there's an issue. Perhaps some data providers do ignore flags, but if this had been brought to our attention earlier, we'd have fixed it (and have done now, though I'm not sure how soon the data will be republished).

rubenpp7 commented 1 year ago

Hi,

Thanks for highlighting this Derek. In EurOBIS we have an internal check (soon to be part of our public QC tool http://rshiny.lifewatch.be/BioCheck/) that compares the aphiaID under scientificNameID with the value under scientificName. So we do consider relevant using scientificName with the original identification together with the scientificNameID to do this crosscheck.

This issue has however given me an idea on how to improve that taxonomy check by adding also the higher classification to the check.

Thank you!

bart-v commented 1 year ago

FYI: GBIF not using the ScientificNameID is a known issue https://github.com/gbif/pipelines/issues/217 And it's a shame: why are we using PIDs after all then...

ymgan commented 1 year ago

@bart-v I agree, please see a different concern when scientificNameID is not being interpreted #895 It could be confusing to the data user and our data provider got confused by why this is happening when they have done their best in providing data with utmost clarity.

timrobertson100 commented 1 year ago

I'll close this, linking to the original issue already capturing this https://github.com/gbif/pipelines/issues/217

derek-mba commented 1 year ago

Please don't close issues as "completed", when they're not. This should have been "merged" into #217.

timrobertson100 commented 1 year ago

Sorry @derek-mba

GitHub doesn't have a merge option for issues, so I linked them and closed this only to try and keep the discussion together on the original issue. The alternative was to close this using the "won't fix" option.

I'll reopen this

timrobertson100 commented 1 year ago

With #217 closed with an implementation I'll also close this again as I don't think there is anything here that isn't covered in that thread, but please comment if I am mistaken.