measurementOrFact versus GGBN

rdmpage commented 6 years ago

@dagendresen Great to see this being done. I'm curious as to the relative merits of measurementOrFact being used to store the link and sequence versus, say, the GGBN extensions. In one example DNA barcode dataset I uploaded I used GGBN, so the sequences look like this https://api.gbif.org/v1/occurrence/1502684137/fragment:

"extensions": {
        "http://data.ggbn.org/schemas/ggbn/terms/Amplification": [
            {
                "consensusSequence": "CCTTTATCTAGTATTTGGTGCTTGAGCTGGAATAGTAGGCACAGCCTTAAGCCTTCTCATTCGAGCAGAACTAAGCCAACCTGGCGCACTCTTAGGAGACGACCAAATCTATAATGTTATTGTTACTGCACATGCCTTCGTAATGATTTTCTTTATAGTAATGCCAATTCTAATCGGGGGGTTTGGAAACTGATTAGTTCCTCTCATGCTTGGAGCCCCTGATATGGCATTCCCTCGTATGAACAACATAAGCTTCTGATTACTCCCTCCGTCATTCCTCCTTTTACTAGCTTCTTCCGGAGTTGAGGCCGGAGCCGGGACAGGTTGAACTGTCTACCCCCCACTGTCTGGTAATCTAGCCCATGCGGGAGCATCAGTAGATTTAACCATCTTCTCCCTGCACCTGGCAGGTATTTCATCAATCCTAGGAGCAATCAACTTTATCACTACCATCATCAACATAAAACCCCCCGCTATCTCTCAATACCAAACTCCTTTATTTGTTTGGGCTGTTCTAATTACTGCCGTTCTTCTACTCCTATCTCTCCCAGTCCTAGCTGCTGGCATTACTATGCTCCTGACCGACCGAAATCTTAATACTACCTTCTTCGATCCCGCAGGAGGAGGAGACCCAATTCTTTACCAACACCTC",
                "geneticAccessionNumber": "KP194104",
                "marker": "COI-5P"
            }
        ]
    },

I don't know whether GGBN will be widely adopted, nor how much data like this GBIF is likely to get. It is also rather hidden in the current GBIF portal as it's not displayed in the HTML view, you have to go through the API.

I guess measurementOrFact has the advantage that the portal supports it already, so people can actually see the sequences (this opens up all sorts of interesting possibilities, such as GBIF analysing sequence data).

The other issue is duplication. As GBIF ingests more and more BOLD sequences, existing records will be duplicated. What if we linked those duplicates? In other words, not only say that this GBIF occurrence from a museum has this DNA barcode, but that DNA barcode is also in GBIF as occurrence xxx?

dagendresen commented 6 years ago

Many thanks for your example on using the GGBN extensions! Depending on further adaptation of the GGBN extensions this is certainly interesting!

Regarding duplication - yes, linking the Museum specimen (dwc:Occurrence) and the BOLD sequence (dwc:Occurrence) is important! However, I would not consider the two things to be direct duplicate dwc:Occurrence resources. The Museum specimen and the BOLD sequence each provide different types of "evidence" (dwc:Occurrence definition) for the same "species occurrence", justifying separate GBIF occurrence records. But, yes, linking the two occurrence resources is very important!

rdmpage commented 6 years ago

@dagendresen Then there are additional questions:

In what direction should the links go? Should museums link to BOLD, or BOLD to museums, or both?
Can we have third-party occurrence datasets that consistent entirely of links between BOLD (or other sequence databases) and GBIF. For example, what if neither BOLD nor a particular museum have the resources to make the links, but a third party (such as myself) does. How to we add those links? (this speaks to the wider issue of being able to augment existing data).

dagendresen commented 6 years ago

I do like the idea of supporting/allowing in GBIF datasets consisting entirely of links between resources in other datasets!

At GBIF.no we where planning to use data annotations to link the museum specimens to the BOLD sequences (etc) and parse these into the Darwin Core archives (for the museum datasets) in the IPT before publishing in GBIF. In addition the Norwegian Museum database system (MUSIT) has recently introduced a module for "measurement data" where the BOLD Process ID identifiers can be reported by the herbarium collection managers/curators - and parsed (by GBIF.no) into the DwCA (together with the Process IDs from the data annotation system) before publishing in GBIF.

PS: In the Report of the Task Group on GBIF Data Fitness for Use in Agrobiodiversity (https://www.gbif.org/document/82283/) I tried on page 27, recommendation 6.9.1, 6.9.2 and 6.9.3, to describe what I mean by the need for an "occurrence-level backbone" of distinct "species occurrences". Meaning distinct "species-occurrences" that there exist more than one dwc:Occurrence records for in GBIF.

rdmpage commented 6 years ago

PS: In the Report of the Task Group on GBIF Data Fitness for Use in Agrobiodiversity (https://www.gbif.org/document/82283/) I tried on page 27, recommendation 6.9.1, 6.9.2 and 6.9.3, to describe what I mean by the need for an "occurrence-level backbone" of distinct "species occurrences". Meaning distinct "species-occurrences" that there exist more than one dwc:Occurrence records for in GBIF.

+1 I'm hoping one day to convince @timrobertson100 that adding one more field to the GBIF index (essentially a "cluster_id") to link duplicate records would be a very useful thing to have. In other words, by default every occurrence has a cluster_id that is the same as it's occurrence id (the GBIF integer at the end of https://gbif.org/occurrence/) If we know or discover that two or more occurrences are the same, they all get the same cluster_id (e.g., take the smallest integer id of the occurrences being clustered). It would be pretty easy to implement...

gdadade commented 6 years ago

Thanks for starting this discussion. I’m speaking here on behalf of GGBN. The GGBN Data Standard was created to fill a known gap of Darwin Core and ABCD which is DNA and tissue specific terms plus sequencing metadata. It is meant to be used together with DarwinCore or ABCD. We are aware of the other existing standards such as MIxS etc. and of course we have included some of their terms.

Of course crosslinking in all directions is our goal and we area already in contact with EMBL, NCBI, DDBJ, BOLD and GBIF to solve this issue. If you have a closer look at the GGBN Data Portal you will see, that our data are usually fed by at least three sources: DNA sample, tissue sample, voucher specimen. All of them associated/related to each other.

GGBN also provides a webservice, so third parties can check how many DNA or tissue samples are available for a certain specimen or taxon at GGBN.

So far we (and GBIF) recommend to not provide DNA and tissue records to GBIF, but only the underlying specimens to avoid duplicates. One can add sequence information, but more important for us is to not forget all inbetween steps such as the physical tissue and DNA sample. So far GBIF does not support associated records. Maybe in the future they will.

MeasurementOrFacts is a very good thing (by the way invented by the ABCD team, not by GBIF), but should only be used if no other terms are available.

We encourage everyone to implement our standard, e.g. Specify has done it recently. Also we hope that others will follow our example and implement our loan and permit vocabulary to fulfill the Nagoya requirements. This is also an important topic we are currently working on with INSDC, BOLD and GBIF. Luckily there are multiple ways of providing data. If you just want to add the BOLD number to your specimen you don’t even have to use the MeasurementOrFacts part, you can simply use the sequencing term DarwinCore already provides.

GBIF-Europe / bold_sequence

measurementOrFact versus GGBN #1