gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
29 stars 16 forks source link

help needed to understand "Taxon match scientific name ID ignored" #5239

Closed ymgan closed 2 months ago

ymgan commented 3 months ago

Hello,

I am trying to understand why many records from this dataset got the flag Taxon match scientific name ID ignored.

Dataset: https://www.gbif.org/dataset/2945946f-8a97-41e4-952d-f0a9438b0f2e Example occurrence with the flag Taxon match scientific name ID ignored: https://www.gbif.org/occurrence/4527121761

Based on the techdocs, the explanation of Taxon match scientific name ID ignored:

The scientificNameID was not used when mapping the record to the GBIF backbone. This may indicate one of:

The ID uses a pattern not configured for use by GBIF

The scientificNameID is in the form of LSID and it should be used by GBIF now (please see https://github.com/gbif/pipelines/issues/217)

The ID did not uniquely identify a concept in the checklist

I hope I understand this right, I think it did uniquely identify a concept in the WoRMS dataset? https://api.gbif.org/v1/species?datasetKey=2d59e5db-57ad-41ff-97d6-11f5fb264527&sourceId=urn:lsid:marinespecies.org:taxname:231413

The ID found a concept in the checklist that did not map to the backbone

It should be mapped:

WoRMS: https://www.gbif.org/species/155304034 GBIF backbone: https://www.gbif.org/species/2434814

A different ID was used, or the record names were used, as no ID lookup successfully linked to the backbone.

There is no other ID provided.

I thought that the records (such as the example) fulfill every requirement listed, can you please help me to understand why are the records still being flagged this way? Thank you so much!

CecSve commented 3 months ago

@muttcg or @fmendezh - I am not sure what is going on here. I cannot see why this record would get flagged. We have a similar issue on help desk right now with the scientific name ID not found flag. I will create a separate issue and link to it here.

CecSve commented 3 months ago

Maybe similar issue with WoRMs? https://github.com/gbif/portal-feedback/issues/5241

CecSve commented 3 months ago

There seems to be an issue with the WoRMS checklist published in GBIF (and CoL). @mdoering is investigating.

mdoering commented 3 months ago

The WoRMS dwc archive is not valid. I have fixed its meta.xml file locally and uploaded a change to checklistbank until WoRMS pushes a fix from their side.

CecSve commented 3 months ago

The bug has been fixed in the next WoRMS release (April 1st 2024).

CecSve commented 2 months ago

This issue is due to copyright restrictions in source datasets for WoRMS. It may not be the case for all records, but AlgaeBase is restricted and cannot be shared outside of WoRMS and this is why the IDs are not interpreted. I'll close the issue for now but let me know if it requires further follow up.

ymgan commented 2 months ago

Thanks for investigating @CecSve ! I am not sure if I am following, the ID in this example is for a southern elephant seal where the record on WoRMS have license = http://creativecommons.org/licenses/by/4.0/

From what I understood, the flag remains because GBIF taxonomic backbone is not yet updated according to this comment.

Will this issue (for southern elephant seals record) continue to be impacted? or would it only be impacting taxa that could not be matched to the AlgaeBase in the future?

mdoering commented 2 months ago

The license is not the problem. We only have a "license" issue with Algaebase - which is included in WoRMS if you search their website, but which they are not allowed to share so GBIF never sees these algae names.

As long as you can find the LSID in GBIF's copy of the WoRMS checklist it should get interpreted just fine: https://api.gbif.org/v1/species?datasetKey=2d59e5db-57ad-41ff-97d6-11f5fb264527&sourceId=urn:lsid:marinespecies.org:taxname:231413

The only thing I can imagine is that this is an outdated interpretation from the time when WoRMS had a broken dwc archive? But even that cannot be the reason, the dataset is from 8th of April and we indexed WoRMS on the 3rd of April.

@fmendezh @muttcg species matching cache maybe?

timrobertson100 commented 2 months ago

I am looking at the cache - like @mdoering, I suspect it wasn't included in the previous WoRMS and we've cached a copy of the lookup response. I'll try and test that theory before flushing the cache.

CecSve commented 2 months ago

Thanks for investigating @CecSve ! I am not sure if I am following, the ID in this example is for a southern elephant seal where the record on WoRMS have license = http://creativecommons.org/licenses/by/4.0/

From what I understood, the flag remains because GBIF taxonomic backbone is not yet updated according to this comment.

Will this issue (for southern elephant seals record) continue to be impacted? or would it only be impacting taxa that could not be matched to the AlgaeBase in the future?

Yes, sorry. I thought the two WoRMS issues were more related than they actually were. Please disregard my previous comment and hopefully the more techy people can find out what is going on.

ymgan commented 2 months ago

Thank you so much for the clarification and looking into this matter, I appreciate it!

I just wanted to make sure that the following is not an oversight:

The registry showed that the ingestion was finished: https://registry.gbif.org/dataset/2945946f-8a97-41e4-952d-f0a9438b0f2e/ingestion-history

but all of the occurrences of the dataset don't seem to be ingested 🤔 it's 0 occurrence now: https://www.gbif.org/dataset/2945946f-8a97-41e4-952d-f0a9438b0f2e

Please ignore me if this is just a temporary state because of the ongoing work and thanks again for looking into this! Have a great weekend!

timrobertson100 commented 2 months ago

Thanks @ymgan

I can confirm that an older lookup was indeed the cause.

We have a separate issue right now on the Elastic index which is being explored - that 0 records will be addressed shortly.

timrobertson100 commented 2 months ago

I think we've diagnosed the cause and fixed the dataset so will close this.

The issue opened above is what we need to do to prevent this from reoccurring.

Thanks @ymgan - have a great weekend too