DwC field scientificNameID is not used at all

bart-v commented 6 years ago

This issue was moved from portal-feedback to pipelines

Example https://www.gbif-uat.org/occurrence/search?dataset_key=740cf4e0-37ca-4389-ba8f-4e1bc5177893&taxon_key=5401803

Lists the records as Oligochaeta and appends the authority "K.Koch" just like that. That makes these marine occurrences terrestrial plants...

While a scientificNameID urn:lsid:marinespecies.org:taxname:2036 is provided, that can be resolved to the animal class Oligochaeta.

This is a missed chance to fix homonyms in an easy way...

mdoering commented 1 year ago

I agree that we should have a "scientific name and identifier mismatch". In case of conflict, it would probably make sense to privilege the scientific name over the ID because it would make it more transparent to users. A scientific name is human readable and doesn't require checking an external source.

I would argue the other way around and prefer the identifier, as this is less of an interpretation from the GBIF side and entirely in the hands of the publisher.

Mesibov commented 1 year ago

@MattBlissett , please look at https://www.gbif.org/occurrence/4399021367 (from a dataset I happened to be checking today).

The publisher gives "Sterna anaethetus" and ID "urn:lsid:marinespecies.org:taxname:212605". The accepted name is "Onychoprion anaethetus" with ID "urn:lsid:marinespecies.org:taxname:567792". GBIF has interpreted (added) the genus as "Onychoprion" and interpreted (added) the taxonomicStatus as "Synonym", but has not changed either the original scientificName or the original scientificNameID.

(1) How would this record look if GBIF preferenced ID over name? (2) What would happen if the publisher had "corrected" the ID but not the name, i.e. gave "Sterna anaethetus" and "urn:lsid:marinespecies.org:taxname:567792"?

timrobertson100 commented 1 year ago

Thank you all for guiding this work.

are you also proposing that GBIF track all the scientificNameID sources and compare the one selected for a record with the name, or just WoRMS APHIA IDs?

Not quite. I suggest we use configuration to enable certain patterns e.g. urn:lsid:marinespecies.org:* mapped to WoRMS. I think it would be wise to do an impact assessment like we did above, before enabling others. I anticipate IPNI, DynTaxa, IndexFungorum would be good candidates though.

what will happen to the original scientificName and scientificNameID entries in a record where GBIF detects a conflict? Will both continue to appear in the interpreted record?

Like @mdoering I think I would lean towards the values found using the scientificNameID since it's less ambiguous and puts the responsibility and control on the publisher - especially if we provide sensible flags that make it easy to detect and fix. The verbatim values are always available of course.

Taking into consideration everyone's comments, I think we would need the following flags to be transparent and to make it easy to locate problematic records (edited to accommodate suggestions from @MattBlissett and @ymgan below).

Issue	Description
`SCIENTIFIC_NAME_ID_IGNORED`	The `scientificNameID` uses a pattern that is not configured in GBIF. The backbone lookup was performed using the names on the record and the `scientificNameID` is nullified in the interpreted record.
`TAXON_CONCEPT_ID_IGNORED`	The `taxonConceptID` uses a pattern that is not configured in GBIF. The backbone lookup was performed using the names on the record and the `taxonConceptID` is nullified in the interpreted record..
`SCIENTIFIC_NAME_ID_NOT_FOUND`	The `scientificNameID` matched a known pattern, but it was not found in the associated checklist. The backbone lookup was performed using the names on the record ignoring the ID and the `scientificNameID` is nullified in the interpreted record. This may indicate a poorly formatted identifier or may be caused by a newly created ID that isn't yet known in the version of the published checklist.
`TAXON_CONCEPT_ID_NOT_FOUND`	The `taxonConceptID` matched a known pattern, but it was not found in the associated checklist. The backbone lookup was performed using the names on the record ignoring the ID and the `taxonConceptID` is nullified in the interpreted record. This may indicate a poorly formatted identifier or may be caused by a newly created ID that isn't yet known in the version of the published checklist.
`SCIENTIFIC_NAME_AND_ID_INCONSISTENT`	The `scientificName` provided in the occurrence record does not precisely match the name in the registered checklist when using the `scientificNameID` or `taxonConceptID` to look it up. Publishers are advised to check the ID is correct, or update the formatting of the names on their records.
`TAXON_MATCH_NAME_AND_ID_AMBIGUOUS`	The GBIF Backbone concept was found using the `scientificNameID` or `taxonConceptID` but it differs from what would have been found if the classification names on the record were used. This may indicate a gap in the GBIF backbone, a poor mapping between the checklist and the backbone, or a mismatch between the classification names and the declared IDs (`scientificNameID` or `taxonConceptID`) on the occurrence record itself.

Please keep the feedback coming - especially if you disagree. Thank you.

Mesibov commented 1 year ago

@timrobertson100, so the answer to my first question is "No different"?

mdoering commented 1 year ago

We would lookup the WoRMS ID 212605 in the checklist published to GBIF: https://www.gbif.org/species/155305680/verbatim

This checklist (like all others) is matched to the backbone and we can take that mapping to find the matching backbone entry for that name, which would be the nubKey property in this API response: https://api.gbif.org/v1/species/155305680

In that case we miss that matching for unknown reasons. I will investigate, because the backbone does have the name Sterna anaethetus Scopoli, 1786 listed sourced from ITIS and regarded as a synonym of Onychoprion anaethetus subsp. anaethetus (Scopoli, 1786).

The point is that we will only link via name and use the name as it is classified in the backbone, not in the source.

If the publisher would move the identifier to the accepted name id, the occurrence would be listed as Onychoprion anaethetus instead and if the given name would remain as Sterna anaethetus, a mismatch flag would be risen.

So regardless of the occurrence identification is given as name strings or identifiers, we must still do a matching to our backbone. The only exception would be if the name/taxon identifier was given as a backbone key directly, then we would not have to do any matching at all. We do regard the checklist matching a little more safe than plain name matching though.

MattBlissett commented 1 year ago

The point is that we will only link via name and use the name as it is classified in the backbone, not in the source.

Or in other words, the "Interpreted" name is the name according to the GBIF backbone, whether we found it by matching strings or looking up an identifier.

Bob's other point:

GBIF has interpreted (added) the genus as "Onychoprion" and interpreted (added) the taxonomicStatus as "Synonym", but has not changed either the original scientificName or the original scientificNameID.

Should we be blanking the interpreted scientificNameID in the situations where it is not used? i.e. for SCIENTIFIC_NAME_ID_IGNORED and SCIENTIFIC_NAME_ID_NOT_FOUND? It remains, of course, in the "Original"/verbatim data.

Edited to add (Tim Robertson): These suggestions have been included in the issue descriptions above

Mesibov commented 1 year ago

@timrobertson100 and @mdoering , leaving aside the interesting question of why WoRMS should get priority in this proposed change, please note that the OBIS manual (https://manual.obis.org/darwin_core.html#taxonomy-and-identification) says

"scientificName (required term) should always contain the originally recorded scientific name, even if the name is currently a synomym [sic]. This is necessary to be able to track back records to the original dataset... A WoRMS LSID should be added in scientificNameID (required term), OBIS will use this identifier to pull the taxonomic information from the World Register of Marine Species (WoRMS) into OBIS, such as the taxonomic classification and the accepted name in case of invalid names or synonyms."

I read that as saying that OBIS contributors can use the original scientificName and the accepted scientificNameID, which guarantees mismatches. Maybe @albenson-usgs could comment?

mdoering commented 1 year ago

I read that as saying that OBIS contributors can use the original scientificName and the accepted scientificNameID, which guarantees mismatches. Maybe @albenson-usgs could comment?

I rather read that as stick with the original name, even if it is considered a synonym now, also for the identifier:

OBIS will use this identifier to pull the taxonomic information from WoRMS into OBIS, such as the accepted name in case of invalid names or synonyms."

Mesibov commented 1 year ago

@mdoering, OK, so an OBIS user enters the original name and the LSID for that original name, and if it's a synonym, then OBIS replaces the original name and ID with the accepted name and ID? Which of these goes into the OBIS records shared with GBIF - the original name + its ID, or the accepted name +its ID?

ymgan commented 1 year ago

I read that as saying that OBIS contributors can use the original scientificName and the accepted scientificNameID, which guarantees mismatches.

Thank you Bob! OBIS nodes use the scientificNameID of the original scientificName if the name provided in the record is a synonym. For example:

scientificName: Diogenichthye scientificNameID: urn:lsid:marinespecies.org:taxname:397410

We are specifically asked to not change the scientificName to its accepted name or/and uses the lsid of the accepted name as shown below:

scientificName: Diogenichthys scientificNameID: urn:lsid:marinespecies.org:taxname:125820

Mesibov commented 1 year ago

@mdoering, does the GBIF backbone perfectly follow the checklists (allowing for update delays) in all cases, so that matching is done with names and classifications in checklists, and the checklists amount to subsets of the backbone?

Mesibov commented 1 year ago

@ymgan, which of these goes into the OBIS records shared with GBIF - the original name + its ID, or the accepted name +its ID?

ymgan commented 1 year ago

which of these goes into the OBIS records shared with GBIF - the original name + its ID, or the accepted name +its ID?

The former goes to GBIF and OBIS: original name + its ID

scientificName: Diogenichthye scientificNameID: urn:lsid:marinespecies.org:taxname:397410

mdoering commented 1 year ago

@mdoering, does the GBIF backbone perfectly follow the checklists (allowing for update delays) in all cases, so that matching is done with names and classifications in checklists, and the checklists amount to subsets of the backbone?

The backbone is currently only updated twice per year using most of the important lists, but not everything there is out there. WoRMS, ITIS, IPNI & ZooBank are included for example. That means there clearly are names not included in the backbone, especially all newly described ones.

The lists that are candidate for name identifiers are updated in various frequencies defined by their publishers. Whenever a new version of a list is imported we match it to the backbone during the import process.

Mesibov commented 1 year ago

Thank you, @ymgan. Does OBIS share with GBIF like this?

scientificName = original name scientificNameID = original name's ID acceptedNameUsage = accepted name acceptedNameUsageID = accepted name's ID

That seems very clear to me. I'm finding the proposed changes to GBIF's interpretation protocol confusing, unless GBIF would do

scientificName = interpreted name scientificNameID = interpreted name's ID verbatimScientificName (not DwC) = original name verbatimScientificNameID (also not DwC) = original name's ID

Mesibov commented 1 year ago

Thank you, @mdoering.

ymgan commented 1 year ago

Thank you @Mesibov for these interesting questions!

Does OBIS share with GBIF like this?

scientificName = original name scientificNameID = original name's ID acceptedNameUsage = accepted name acceptedNameUsageID = accepted name's ID

According to my understanding, no - because the accepted name and accepted name's ID may become unaccepted in the future. So we are trained/asked to provide scientificName, scientificNameID along with scientificNameAuthorship, kingdom, taxonRank, taxonRemarks of the original name.

Mesibov commented 1 year ago

Thank you, @ymgan. So the looking-up of the original name's ID in WoRMS is done twice, independently. OBIS does it (according to the Manual) for its own purposes but doesn't share the result. GBIF does it and may change what it receives from OBIS (in the case of unaccepted names) but does not make clear what's happened (see the "Sterna anaethetus" example, above).

timrobertson100 commented 1 year ago

and may change what it receives from OBIS

It may just be the phrasing, but there may be a misunderstanding to clarify. A data publisher publishes a dataset, which is registered in the GBIF registry and linked to the OBIS network. Both GBIF and OBIS ingest the same dataset from the source (generally a GBIF IPT) and process it for the services each infrastructure provides.

Mesibov commented 1 year ago

@timrobertson100, thank you for that clarification. I misinterpreted the source of the record I cited above. The publisher (https://www.gbif.org/publisher/5fa89f68-9af0-4a0d-8998-ea39695c1db9) is "CSIRO NCMI IDC / OBIS Australia", so I assumed that "OBIS Australia", a node in the OBIS network, was co-publisher.

Mesibov commented 1 year ago

OBIS provides useful information on the fields it provides for its own downloads: https://obis.org/data/access/.

scientificName is derived from the provided scientificNameID, but there is also an originalScientificName field for the name as provided. There is an APHIA ID field with "the valid name based on the scientificNameID or derived by matching the provided scientificName with WoRMS".

I'm not sure I follow the bit after "or". I think it might apply in cases where (contrary to the requirement) only a name was provided and not an ID. There is no originalScientificNameID field.

OBIS also checks name and ID fields for quality: https://github.com/iobis/obis-qc. OBIS checks

to ensure that scientificNameID is a valid WoRMS ID
to ensure that scientificName and scientificNameID match in WoRMS

So yes, the same tests would be done twice, independently, for datasets published through OBIS and through GBIF, if GBIF goes ahead with preferencing ID to name in its lookups.

ManonGros commented 1 year ago

I think the proposed flags https://github.com/gbif/pipelines/issues/217#issuecomment-1696930580 are sensible. I can see the point of prioritising the IDs for interpretation (in the cases where they don't match the name). If we get too many confused users, we should consider changing the behaviour or rolling back on the changes.

ymgan commented 1 year ago

Thank you so much!! I agree with @ManonGros

I think the proposed flags https://github.com/gbif/pipelines/issues/217#issuecomment-1696930580 are sensible. Maybe it would come clear when we see how these flags apply to the examples encountered.

Just throwing out idea, I am wondering if it make sense for the flags to follow the vocab from BDQ TG2 https://github.com/tdwg/bdq/issues/152#issue-354943638 ? For example:

SCIENTIFIC_NAME_AND_ID_INCONSISTENT
TAXON_MATCH_NAME_AND_ID_AMBIGUOUS

Do these capture the same meaning?

On the other hand, if there are things worth to be mentioned in OBIS manual that could help to prevent certain issues identified in the test run (for future data), it will be great if someone could create an issue at https://github.com/iobis/manual (I hope this doesn't side track the conversation)

Thank you so much again!

timrobertson100 commented 1 year ago

Thanks @ymgan - That makes good sense. I'll adjust the labels above accordingly.

albenson-usgs commented 1 year ago

Morning all! Just catching up 😊

Like @mdoering I think I would lean towards the values found using the scientificNameID since it's less ambiguous and puts the responsibility and control on the publisher - especially if we provide sensible flags that make it easy to detect and fix. The verbatim values are always available of course.

Yes I agree with this completely. It sounds like that is the conclusion that was reached but wanted to voice my support. It is not up to the aggregators to correct issues that might arise but instead to flag them for publishers to address.

I also agree that the proposed flags https://github.com/gbif/pipelines/issues/217#issuecomment-1696930580 are sensible. I don't have any changes to suggest or ones to add at this time.

I do think it's worth considering that while what Ming states is the current best practice ("OBIS nodes use the scientificNameID of the original scientificName if the name provided in the record is a synonym") what Bob provided was the original instruction ("scientificName (required term) should always contain the originally recorded scientific name, even if the name is currently a synomym [sic]. This is necessary to be able to track back records to the original dataset... A WoRMS LSID should be added in scientificNameID (required term)") and therefore there will be datasets following that practice. Note that verbatimIdentification is a relatively new term in Darwin Core and so the process Bob mentioned was the only way to keep the original name as it was provided when datasets were matched to WoRMS during OBIS processing. Now we have improved options but there will be datasets that had to follow that original instruction and it may not be possible to update them easily. I think the SCIENTIFIC_NAME_AND_ID_INCONSISTENT flag will be appropriately applied so I think there is nothing we need to change for this. I just want us to be aware.

Going back to this issue (https://github.com/gbif/pipelines/issues/217#issuecomment-1680684346) Tim identified at the beginning, I would understand this would get a SCIENTIFIC_NAME_AND_ID_INCONSISTENT flag. I think what might be difficult for nodes, and I'm not sure what we could do to help them, is that flag won't help them know that the fuzzy match has led to an unexpected result. Perhaps this is where the data providers must come in to check these.

Finally I do make use of verbatimIdentification so I wouldn't advise that being used as Bob has described in https://github.com/gbif/pipelines/issues/217#issuecomment-1697021413. As an example here is a species lookup table I've been working with recently where I needed to make use of verbatimIdentification. sciname_crosswalk.csv

Mesibov commented 1 year ago

@albenson-usgs, many thanks for your comments.

Please note that verbatimIdentification is not the same as "verbatimScientificName". vI (https://dwc.tdwg.org/terms/#dwc:verbatimIdentification) is a place for informal names, guesses, vernacular names etc as well as formal scientific names and "is meant to be used in addition to dwc:scientificName (and dwc:identificationQualifier etc.), not instead of it".

vI is a handy field for data checkers because it allows us to say to compilers: "coral sp. 1a is not appropriate for scientificName. Please put coral sp. 1a in verbatimIdentification and put a formal scientific name for the coral taxon in scientificName".

A record could therefore have

an original ID (in verbatimIdentification)
an original scientificName
an original scientificNameID
an interpreted scientificName
an interpreted scientificNameId

I suggest that to avoid confusion, all 5 fields should be present in the record made available to data end-users.

timrobertson100 commented 1 year ago

Thank you all for contributing so actively to this thread.

I'll now implement those flags, prepare a configuration that handles the WoRMS LSIDs, and process all datasets using them. I am sure we'll refine this again, but getting GBIF and OBIS more closely aligned should be a good start.

I have a remaining question - How strictly should the scientificName comparison be when flagging differences?

Consider e.g. a record with Aus bus in the name, and an LSID that returns Aus bus L. 1771. Should I flag anything not exactly the same, or should I parse both sides and compare the resulting canonical? I'm tempted to suggest the latter as a start to avoid too many nuisance flags, perhaps making it stricter in the future. Thoughts?

Mesibov commented 1 year ago

@timrobertson100, please report back here or in the GBIF forum/data blog with the flag tallies you find for records with WoRMS IDs.

DwC scientificName's recommendation is "with authorship and date information if known". Authorship and authorship/date are missing in a large proportion of datasets I see or are put in scientificNameAuthorship. Please parse to canonical. I'm assuming your parser will deal properly with "subsp/subsp./ssp/ssp...." etc.

timrobertson100 commented 1 year ago

Thanks, @Mesibov - that was my intuition too.

Mesibov commented 1 year ago

@mdoering, don't you think that comparing interpreted values will risk piling one error on top of another? Interpreted values are sometimes wrong, and in any case are not the responsibility of the data publisher.

pieterprovoost commented 1 year ago

Sorry, I have been away for a while.

So the looking-up of the original name's ID in WoRMS is done twice, independently. OBIS does it (according to the Manual) for its own purposes but doesn't share the result.

The results are shared. OBIS currently performs lookup of the provided scientificNameID in WoRMS (usually a WoRMS LSID but can also be a BOLD or NCBI ID), or matches the scientific name using the WoRMS API in case the ID is missing or invalid. We then replace the full taxonomy with the taxonomy of the accepted name. So the user sees:

AphiaID: interpreted WoRMS ID for the accepted name
scientificName and higher ranks: accepted names according to WoRMS
originalScientificName: scientificName as provided
scientificNameID: scientificNameID as provided

For consistency we should probably rename AphiaID to scientificNameID and scientificNameID to originalScientificNameID (or some other form indicating verbatim value).

I read that as saying that OBIS contributors can use the original scientificName and the accepted scientificNameID, which guarantees mismatches.

No, if the name is a synonym we recommend providing the ID for the synonym.

Consider e.g. a record with Aus bus in the name, and an LSID that returns Aus bus L. 1771. Should I flag anything not exactly the same, or should I parse both sides and compare the resulting canonical? I'm tempted to suggest the latter as a start to avoid too many nuisance flags, perhaps making it stricter in the future. Thoughts?

Yes, please use the canonical. The OBIS recommendation is still to not provided the authorship (despite the DwC definition).

https://github.com/gbif/pipelines/issues/217#issuecomment-1696930580

The flags make sense, I'll see if we can implement them as well.

Mesibov commented 1 year ago

@pieterprovoost, many thanks for participating, and I apologise for my use of "sharing". At the time I thought OBIS shared its processing results with GBIF. It doesn't. Publishers independently publish to OBIS and to GBIF, so the processing of scientificNameID happens in parallel. This comment deals with that.

@timrobertson100 generated a file that has taxonomic details for records with scientificNameID populated with WoRMS LSIDs. These are the ca 19000 cases in which there would be a disagreement in names if GBIF looked up the scientificNameID in WoRMS and compared it to the provided (original) scientificName.

The file contains numerous instances in which well-formed, formal taxonomic names have more than one scientificNameID. Here's one example. It's "Coscinoderma matthewsi", a misspelling of Coscinoderma mathewsi (Lendenfeld, 1886), urn:lsid:marinespecies.org:taxname:165090.

This looks to me like an incremental fill-down error in a spreadsheet. The data compiler entered the correct LSID ("...165090"), then filled down incrementally over the next 17 records.

Because @timrobertson100 did not include a record ID I can't track the records back to GBIF or OBIS. However, the OBIS page for this sponge species says there were no invalid scientificNameIDs and no dropped records. Can you explain what is likely to have happened when the above records with incorrect scientificNameIDs were processed by OBIS? Did OBIS process each record according to its scientificNameID, thus incorrectly assigning the record to the wrong species?

Note that GBIF processed the records according to scientificName and correctly assigned all 17 records to Coscinoderma mathewsi (Lendenfeld, 1886).

Mesibov commented 1 year ago

@pieterprovoost, I found the dataset and the "Coscinoderma matthewsi" occurrences in GBIF. The dataset is Vulnerable marine ecosystems in the South Pacific Ocean region and was published by New Zealand's NIWA. Example occurrence here. The OBIS link.

Mesibov commented 1 year ago

@pieterprovoost, I've answered my own question: OBIS processed "Coscinoderma matthewsi" as Coscinoderma pesleonis based on the incorrect scientificNameID "...165091". The OBIS record id is d47e860b-eff5-42e2-9534-6d24a4810767 from the dataset 6c813a6c-86f7-4d45-beb5-33eebd8de938. How will OBIS correct existing errors of this kind and how will they be treated in future?

pieterprovoost commented 1 year ago

@Mesibov I'll have to run the check and report back to our data providers to get this fixed. I'll also add the flags in our QC procedures to prevent this in the future.

timrobertson100 commented 1 year ago

Thanks to all for guidance on this.

GBIF.org now processes scientificNameID, taxonID and taxonConceptID for configured identifier schemes. The following flags have been added to help publishers and consumers understand how the identifiers have been used and where ambiguities may be detected.

Published	Interpreted
`TAXON_MATCH_SCIENTIFIC_NAME_ID_IGNORED` `TAXON_MATCH_TAXON_CONCEPT_ID_IGNORED` `TAXON_MATCH_TAXON_ID_IGNORED`	The …ID was not used when mapping the record to the GBIF backbone. This may indicate one of: The ID uses a pattern not configured for use by GBIF The ID did not uniquely(!) identify a concept in the checklist The ID found a concept in the checklist that did not map to the backbone A different ID was used, or the record names were used, as no ID lookup successfully linked to the backbone
`SCIENTIFIC_NAME_ID_NOT_FOUND` `TAXON_CONCEPT_ID_NOT_FOUND` `TAXON_ID_NOT_FOUND`	The …ID matched a known pattern, but it was not found in the associated checklist. The backbone lookup was performed using either the names or a different ID field from the record. This may indicate a poorly formatted identifier or may be caused by a newly created ID that isn't yet known in the version of the published checklist.
`SCIENTIFIC_NAME_AND_ID_INCONSISTENT`	The scientificName provided in the occurrence record does not precisely match the name in the registered checklist when using the scientificNameID, taxonID or taxonConceptID to look it up. Publishers are advised to check the IDs are correct, or update the formatting of the names on their records.
`TAXON_MATCH_NAME_AND_ID_AMBIGUOUS`	The GBIF Backbone concept was found using the scientificNameID, taxonID or taxonConceptID, but it differs from what would have been found if the classification names on the record were used. This may indicate a gap in the GBIF backbone, a poor mapping between the checklist and the backbone, or a mismatch between the classification names and the declared IDs (scientificNameID or taxonConceptID) on the occurrence record itself.

The GBIF.org site and API allow search using issues, such as this example.

Initially the WoRMS LSIDs have been enabled to bring consistency to GBIF and OBIS processing and to work through any teething issues. Other candidates for future use could be the International Plant Name Index LSIDs, Swedish Dyntaxa LSIDs, Catalogue of Life Identifiers, Zoobank LSIDs and Index Fungorum LSIDs.

Because this issue has become very long, and the original request from @bart-v and OBIS team is implemented, this seems like a suitable point to try and close this issue. Please do open issues for specific improvement requests, bugs etc or in https://discourse.gbif.org/

bart-v commented 1 year ago

Excellent progress & impressive work. Much appreciated Thanks a lot @timrobertson100

gbif / pipelines

DwC field scientificNameID is not used at all #217