gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

BIN become species name – and it’s not even a clear BIN? #148

Closed kkongshavn closed 9 months ago

kkongshavn commented 1 year ago

I got a question from the entomologists: there are six records showing up in artskart of Anthocoris minki, which is not found in Norway. Following the breadcrumbs, it’s these six records in GBIF, where the Scientific name is given as “BOLD:ACX9534 (cf. Anthocoris minki)”. This is coming from iBOL, and the BIN is in BOLD like this, with three different species names being used – so it’s by no means certain that A. minki is the correct one. What is going on here?

Is this to do with the taxonomic backbone of GBIF? Interpreting a BIN as a species based on one of the (in this case three) species names in a BIN seems like a very suboptimal solution 😬 https://www.gbif.org/species/10005321

@knutanders, Steffen Roth has been in touch with you about this - any thoughts?

edit to tag @rukayaj

rukayaj commented 1 year ago

I think it is happening in the mapping of the name to the backbone of GBIF. I'm not sure about this, but we can make an issue on their github and ask. Any idea why A. minki is in orange and the others are in yellow?

kkongshavn commented 1 year ago

Any idea why A. minki is in orange and the others are in yellow? No, I don't know why it is like that, sorry

MichalTorma commented 1 year ago

I think the colors are just a representation of the likelihood that the given specie is actually present... Therefore it means that A. minki is the least likely candidate for what's in the bin. Tough, to be said, cf. in front of the name means it's uncertain identification. But I agree this seems like an issue with GBIF's interpretation of BOLD because if they want to show some name. it should at least be the most likely one maybe?

ymgan commented 1 year ago

Oh, that confuses me too! Take this occurrence for example: https://www.gbif.org/occurrence/4038820847

If we go to the specimen record using its occurrenceID (http://bins.boldsystems.org/index.php/Public_RecordView?processid=ZMBN1592-19), the species name displayed under taxonomy is Anthocoris confusus, which is under the verbatimIdentification of the occurrence record from iBOL (https://www.gbif.org/occurrence/4038820847). Is this the name given by the taxonomist who identified the specimen?

MichalTorma commented 1 year ago

Oh, that confuses me too! Take this occurrence for example: https://www.gbif.org/occurrence/4038820847

If we go to the specimen record using its occurrenceID (http://bins.boldsystems.org/index.php/Public_RecordView?processid=ZMBN1592-19), the species name displayed under taxonomy is Anthocoris confusus, which is under the verbatimIdentification of the occurrence record from iBOL (https://www.gbif.org/occurrence/4038820847). Is this the name given by the taxonomist who identified the specimen?

I think so. but apparently, GBIF takes precedence in some automated species name extraction from BOLD. I don't think that's the correct way. Especially when there is a verbatimIdentification - that should be more important than cf. identification based on the bin.

ymgan commented 1 year ago

I see what you are saying @MichalTorma and I agree with you.

I felt that the way the data is presented in the iBOL dataset seems to suggest that it is a DNA derived occurrence (genetic material is the sole evidence of the occurrence) instead of enriched occurrences, which feels a little misleading.

dagendresen commented 1 year ago

... without any full overview of this thread! (sorry if jumping in at too deep water) I believe that the actual material sample from which the DNA sequence in iBOL would be (should have been) stored in a biobank/DNA bank at the museum in Bergen -- and that the occurrence data record published from iBOL (as in the example) correctly only has the DNA sequence as the actual evidence? that is as seen from the point of view from iBOL?

occurrenceKey 4038820847 references the identifier ZMBN1592-19 which I assume might be the catalogNumber (museum id (?)) for the actual material sample - and that this sample might be (should have been) stored in Bergen?

and that the data record from iBOL (with the sequence as evidence - and with the species name inferred from the DNA sequence) -- should have been stronger linked to the actual physical material sample at the museum in Bergen with the species name as identified by Steffen Roth

kkongshavn commented 1 year ago

An aside from the question about using BINs as taxonomy (which I believe is a really, really bad idea! Bins chage all the time, and people send specimens that are halfway identified and see where they end up - so there is a lot of nonsense), about BOLD identifiers:

BOLD operates with as few different ways of identifying a record. Main ones are SampleID (assigned by user) and ProcessID (assigned by BOLD on format ProjectcodeRunningNumbers-year)). People use them in different ways (yay). Steffen has a project named "ZMBN", which is also what we-the-marine-invertebrates use as our catalogue numbers, so this is a bit confusing.

In the case of these insects, they have been given a SampleID that is on the format HetNorNNN, e.g. HetNor196, whilst the ProcessID is ZMBN1644-20. Then I think the idea is to add museum catalogue numbers later, I will have to ask Steffen what they do there.

Typically for the marine collections, the SampleID would be ZMBN_123456, ProcessID HABFA2141-22) - the catalogue number would then be ZMBN 123456). For the marine inverts we have started adding both SampleID and ProcessID in the "Voucher" field in musit, in hopes that this will (someday?) make it easier to link records in GBIF from us and from BOLD.

The DNA extract is (per today, and in most cases) stored at the iBOL facilities in Canada, the voucher specimen that the tissue-sample was taken from is in the museum collection.

knutanders commented 1 year ago

Sorry for the late reply @kkongshavn! It's been a very busy week.

iBol publishes this record with Anthocoris minki as value for the DwC term 'species' and with no value for the term 'identificationQualifier'. The term 'scientificName' has a reference to an ID from BOLD but, as of today, Artskart is not able to interpret this ID.

This is something that may change in the future, but it would require a fair bit of development and testing to ensure that errors are not introduced.

Is it possible the Bergen University Musum can contact iBol/BOLD and ask them to change the data they publish?

kkongshavn commented 1 year ago

Sorry for the late reply @kkongshavn! It's been a very busy week.

iBol publishes this record with Anthocoris minki as value for the DwC term 'species' and with no value for the term 'identificationQualifier'. The term 'scientificName' has a reference to an ID from BOLD but, as of today, Artskart is not able to interpret this ID.

This is something that may change in the future, but it would require a fair bit of development and testing to ensure that errors are not introduced.

Is it possible the Bergen University Musum can contact iBol/BOLD and ask them to change the data they publish?

I think the problem is in how GBIF read the names from BINs? To me it makes no sense that specimens in a museum collection that are identified by experts should have another name imposed on them because they are genetically similar ("identical-ish") to other specimens identified by someone else to a another species. And in this case, there was a "cf." in that second ID that has then disappeared.

Why is the name on the specimens from us not the same in GBIF as what we uploaded them as in BOLD?

rukayaj commented 1 year ago

So @kkongshavn are you saying you think e.g. for this record https://www.gbif.org/occurrence/4038820847 we should publish the verbatim identification "Anthocoris confusus" as the scientificName?

dagendresen commented 1 year ago

NB! this Occurrence https://www.gbif.org/occurrence/4038820847 is published by iBOL. Is the same material sample maybe also published from UiB? (from the UiB insect collection?). Notice that there is no Anthocoris confusus in any of the UiB datasets.

kkongshavn commented 1 year ago

So @kkongshavn are you saying you think e.g. for this record https://www.gbif.org/occurrence/4038820847 we should publish the verbatim identification "Anthocoris confusus" as the scientificName?

What Steffen has identified this particular specimen to is Anthocoris confusus. It has (interim) museum number HetNor170, BOLD processID ZMBN1592-19, and has - based on the DNA barcode - been assigned to the BIN "BOLD:ACX9534". In this BIN, there are other names as well - but that does not mean that our specimen should be forced into one of those names (which, according to Steffen, are wrong. That may not always be the case - it could be that COI is not good for telling apart species in this case, it could be a species complex, or there could be other things going on. We cannot tell that for sure from looking only at the barcode and a photo.)

Letting a name from a BIN overshadow the name assigned to the specimen by the expert who has identified it is a very poor solution

EDIT: sorry, it has museum number HetNor113, and now I'm confused, because that doesn't make sense. I'll ask Steffen.

kkongshavn commented 1 year ago

NB! this Occurrence https://www.gbif.org/occurrence/4038820847 is published by iBOL. Is the same material sample maybe also published from UiB? (from the UiB insect collection?). Notice that there is no Anthocoris confusus in any of the UiB datasets.

It should be/come in the UiB entomology collection - I suspect there may be a lag in registering/uploading? I'll ask.

knutanders commented 1 year ago

I still think the problem here is that iBOL publishes these records with Anthocoris minki as value for the DwC term 'species' and with no value for the term 'identificationQualifier'. I am not totally sure what is going on with the BIN that is given as value for the term acceptedScientificName but Artskart cannot interpret this ID as of today. The core to the problem is, however, the way the data are published by iBOL.

It can perhaps be part of a solution to use verbatimIdentification as scientific name, but we cannot do that without thoroughly testing the solution first. Also, I am not sure I understand why it should be better to use the name from 'verbatimIdentification' rather than from 'species'. In this case the former one is correct, and the latter is incorrect but how do we know that more generally?

I think all the observations in Norway erroneously named Anthocoris minki by iBol are related to a project funded by Artsdatabanken and let by Karl Thunes at NIBIO. In this project many of the specimens were identified by Steffen Roth but these are not necessarily the same material samples as published by iBOL. Information about the project can be found here: https://www.artsdatabanken.no/Pages/223784/Virvelloese_dyr_i_askekroner_br__small_38-16__small

The data is published by Artsdatabanken (with scientific names identified by Steffen Roth) in this dataset: https://ipt.artsdatabanken.no/resource?r=askekroner_38-16

kkongshavn commented 1 year ago

I still think the problem here is that iBOL publishes these records with Anthocoris minki as value for the DwC term 'species' and with no value for the term 'identificationQualifier'. I am not totally sure what is going on with the BIN that is given as value for the term acceptedScientificName but Artskart cannot interpret this ID as of today. The core to the problem is, however, the way the data are published by iBOL.

It can perhaps be part of a solution to use verbatimIdentification as scientific name, but we cannot do that without thoroughly testing the solution first. Also, I am not sure I understand why it should be better to use the name from 'verbatimIdentification' rather than from 'species'. In this case the former one is correct, and the latter is incorrect but how do we know that more generally?</>

I think we agree(?) that the issue comes from how iBOL/GBIF talks about BINs and their identification. So if a name in a BIN "becomes" the name of all specimens found in that BIN when it comes in GBIF (and from there into Artskart, right? It doesn't go straight from iBOL to Artskart?) regardless of the original identification, that creates problems. If this isn't just a very strange single case, I assume this is a systemic issue and something GBIF & iBOL need to decide how to resolve? @dagendresen ?

I think all the observations in Norway erroneously named Anthocoris minki by iBol are related to a project funded by Artsdatabanken and let by Karl Thunes at NIBIO. In this project many of the specimens were identified by Steffen Roth but these are not necessarily the same material samples as published by iBOL. Information about the project can be found here: https://www.artsdatabanken.no/Pages/223784/Virvelloese_dyr_i_askekroner_br__small_38-16__small

The data is published by Artsdatabanken (with scientific names identified by Steffen Roth) in this dataset: https://ipt.artsdatabanken.no/resource?r=askekroner_38-16</>

-Yes, it is these records that are misnamed in this case. Also: These entries have not been uploaded to musit from UMB yet, that's why it is not found in the collection from us now - but yes, it's been reported/published as part of the Artsprosjekt.

dagendresen commented 1 year ago

Thanks Knut Anders for identifying the dataset https://doi.org/10.15468/tsskag and thus the corresponding data records https://www.gbif.org/occurrence/2863694460 = https://www.gbif.org/occurrence/4038820847

The core to the problem is, however, the way the data are published by iBOL.

Yes, this is how the data is published from iBOL. GBIF will not modify data from a data publishing institution.

Also, I am not sure I understand why it should be better to use the name from 'verbatimIdentification' rather than from 'species'. In this case the former one is correct, and the latter is incorrect but how do we know that more generally?

Yes, I think it is a very bad idea to use http://rs.tdwg.org/dwc/terms/verbatimIdentification as a replacement for what the data publishing institution (here iBOL) has published as http://rs.tdwg.org/dwc/terms/scientificName (here BOLD:ACX9534) and taxonID (here http://www.boldsystems.org/index.php/Public_BarcodeCluster?clusteruri=BOLD:ACX9534)

I believe the approach should rather be to try to understand better why iBOL makes this mistake in their reference library ... and try to help iBOL discriminate the reference sequences better for matching to the correct species.

knutanders commented 1 year ago

@kkongshavn Artskart reads data directly from iBOL. In theory the data could also go via GBIF but that is not usually what happens :0)

image

kkongshavn commented 1 year ago

Hi all, I can try to reach out to BOLD support about this after summer. I might need some help formulating the question though.

kkongshavn commented 10 months ago

Hi all, trying to pick up again this. If I write to BOLD support, what should I ask them? An attempt is below (does it make sense?):

Six records with the following BOLD "SampleId [identification]" ZMBN1592-19[Anthocoris confusus], ZMBN1644-20[Anthocoris confusus], ZMBN1599-19[Anthocoris confusus], ZMBN1669-20[Anthocoris simulans], ZMBN1671-20[Anthocoris simulans], ZMBN1591-19[Anthocoris confusus] are all showing up in the national species portal for Norway, Artskart (https://artskart.artsdatabanken.no/) as Anthocoris minki, which is not found in Norway.

Following the breadcrumbs, it’s this is coming from how iBOL publishes the data. The BIN in BOLD BOLD:ACX9534, has three different species names being used – so it’s by no means certain that A. minki is the correct one. Yet iBol publishes this record with Anthocoris minki as value for the DwC term 'species' and with no value for the term 'identificationQualifier'. The term 'scientificName' has a reference to an ID from BOLD but, as of today, Artskart is not able to interpret this ID. Are you able to change how you publish the value for species, or do you have other suggestions to how this might be solved?

rukayaj commented 10 months ago

I think that's ok @kkongshavn. One sentence needs slight rephrasing:

"Following the breadcrumbs, it looks like this is coming from how iBOL publishes the data. "

But yes, I think this is what the problem is, right @knutanders ?

knutanders commented 10 months ago

Hi @kkongshavn @rukayaj

Perhaps the e-mail to iBOL can be written so that the problem is formulated as a more general problem? If it is posed as a questions related to Artskart, they are perhaps less likely to look into it (?).

iBOL publishes these observations with an incorrect name for the species term and they provide no information to indicate uncertainty.

kkongshavn commented 10 months ago

iBOL publishes these observations with an incorrect name for the species term and they provide no information to indicate uncertainty.

In GBIF they are also given with incorrect name, though there is at least a cf. on it; BOLD:ACX9534 (cf. Anthocoris minki) - but in Artskart this uncertainty is missing completely.

But yes, I agree that I should try to phrase it in more general terms.

kkongshavn commented 9 months ago

New draft: Hello, I am contacting you on behalf of GBIF Norway to ask about a puzzling case we are seeing in how taxonomy of records in BOLD is communicated further. A small case study: six records with the following BOLD "SampleId [identification]" ZMBN1592-19[Anthocoris confusus], ZMBN1644-20[Anthocoris confusus], ZMBN1599-19[Anthocoris confusus], ZMBN1669-20[Anthocoris simulans], ZMBN1671-20[Anthocoris simulans], ZMBN1591-19[Anthocoris confusus] are all showing up in the national species portal for Norway, Artskart (https://artskart.artsdatabanken.no/) as Anthocoris minki, which is not found in Norway. This is coming from how iBOL publishes the data. The BIN in BOLD BOLD:ACX9534, has three different species names being used – so it’s by no means certain that A. minki is the correct one. Yet iBOL publishes this record with Anthocoris minki as value for the DwC term 'species' and with no value for the term 'identificationQualifier'. Are you able to change how you publish the value for species, or do you have other suggestions as to how this might be solved?

kkongshavn commented 9 months ago

Sent this: Hello, I am contacting you on behalf of GBIF Norway to ask about a puzzling case we are seeing in how taxonomy of records submitted to BOLD is communicated outwards.

I must admit I find this a bit hard to explain, so I will give a small case study: these six records from Norway are in BIN BOLD:ACX9534. (format here: "ProcesssID [identification]") ZMBN1592-19[Anthocoris confusus], ZMBN1644-20[Anthocoris confusus], ZMBN1599-19[Anthocoris confusus], ZMBN1669-20[Anthocoris simulans], ZMBN1671-20[Anthocoris simulans], ZMBN1591-19[Anthocoris confusus]

But they all show up in the Norwegian national species portal "Artskart" (https://artskart.artsdatabanken.no/) as Anthocoris minki, a species which is not found in Norway.

This seems to come from how iBOL publishes the data. The BIN in BOLD BOLD:ACX9534, has three different species names being used – so it’s by no means certain that A. minki is the correct one. Yet iBOL publishes this record with Anthocoris minki as value for the DwC term 'species' and with no value for the term 'identificationQualifier'.

Or at least that's what we think is happening - I'd like to hear what you think!

If this is the case, are you able to change how you publish the value for species, or do you have other suggestions as to how this might be solved?

Kind regards, Katrine Kongshavn

kkongshavn commented 9 months ago

🤔 Did it resolve itself..? I first got a reply from general support who just said that they don't control how others ingest their data, but then a follow up came (from BOLD in bold):

Thank you for your email, I did some digging through BOLD's API, our Darwin core output, GBIF export along with external data releases, and for all of the list cases, the records listed in your email are associated with "Anthocoris confucius" and not "Anthocoris minki" (even though the BIN BOLD:ACX9534 does contain both species name). I have not worked with the Norwegian national species portal "Artskart" (https://artskart.artsdatabanken.no/) so I'm unfamiliar with their data ingestion process. Would you be able to shed some light on where Artskart is pulling this data from? Thanks!

And then me: _thank you for looking into this! I will need some help from the people in Artskart to answer your question, as I am not very familiar with the data ingestion process there. However, I'm starting to wonder if some update has been run (in one of the ends), because today I cannot find the six troublesome ones anywhere - not in GBIF, and not in Artskart..! It used to be that if I searched for A. minki & Norway in GBIF, the six would appear (but as cf. minki) - now there are nil (link). How very odd - I will ask if anyone knows of changes being made. Either way; case solved? 😅 I will update you if I make any headway!_

Does anyone know what might have happened? I tried downloading the data for A. confucius from Artskart to make sure the six in question are still there, but it stalls and stops.