gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Are IBOL and EMBL-EBI records really occurrence records? #4126

Open Mesibov opened 2 years ago

Mesibov commented 2 years ago

GBIF has more than 9.3 million "occurrence records" from the International Barcode of Life project (IBOL). Here I trace an IBOL record in GBIF for the centipede Craterostigmus tasmanianus, with unknown date and unknown location: https://www.gbif.org/occurrence/2249233625

The occurrenceID for this record is http://bins.boldsystems.org/index.php/Public_RecordView?processid=GBA30229-19 Going to that webpage I find no occurrence details at all. The barcode was mined by IBOL from GenBank HQ453435. Going to the GenBank webpage https://www.ncbi.nlm.nih.gov/nuccore/HQ453435 I again find no occurrence details, but the voucher from which the barcode was obtained is referenced as "isolate DNA103594" from the Museum of Comparative Zoology. Going to MCZ's online database MCZBASE https://mczbase.mcz.harvard.edu/ I enter "DNA103594" as a DNA reference code and learn that the DNA came from the MCZ registered sample "MCZ:IZ:132147". Going to the MCZ webpage for that sample https://mczbase.mcz.harvard.edu/guid/MCZ:IZ:132147 I learn that the IBOL record derives from my own collection on Kerrisons Road (Tasmania), 41°12'11"S 146°42'21"E, on 2008-08-15.

To get that information I went from GBIF to IBOL to GenBank to MCZBASE. Further, the MCZ sample (MCZ:IZ:132147) is already in GBIF with its own occurrence record: https://www.gbif.org/occurrence/728924393

Similarly, GBIF has more than 6.2 million records from the European Nucleotide Archive (EMBL-EBI). Tracing this one https://www.gbif.org/occurrence/3349999256 I again see an unknown location and date, but this time the occurrenceID is a GenBank code (HQ026642) and the catalogNumber is an MCZ sample code (MCZ:IZ:132153). Again, the sample from which the sequence was derived has its own GBIF record: https://www.gbif.org/occurrence/728924377

There are, in fact, 13 samples in MCZ from that one Kerrisons Road collection (MCZ:IZ:132147 - MCZ:IZ:132159). Each has its own GBIF occurrence record and at least one IBOL and/or EMBL-EBI record without collection details.

In what sense are the IBOL and EMBL-EBI records in GBIF useful occurrence records? How many of these 15 million records would require tracing of digital breadcrumbs to get the original occurrence data?

MortenHofft commented 2 years ago

It doesn't answer the question, and I'm not the one to do so. But here is some numbers: 792,187 RESULTS from EMBL cluster as of today (out of 4.7 mil) 938,965 RESULTS from iBol is in a cluster today (out of 9.3 mil)

E.g. https://www.gbif.org/occurrence/863247726/cluster from Museum of Comparative Zoology, Harvard University

But no doubt many links are missed - often due to missing identifiers on the records. You just proven so with 2 nice examples.