Open sformel-usgs opened 5 days ago
Very nice to have a simple, but realistically structured dataset to work with.
In relation to bullet 3
This touches on an aspect that relates to what @pragermh brings up https://github.com/gbif/rs.gbif.org/issues/136#issuecomment-2136912091.
If you look at eDNA data from the OTU table perspective, there are two dimensions with metadata - metadata that relates to the OTUs and to the Samples, respectively. All Sample metadata is in reality connected to the Event (or a parentEvent), and the DNA_sequence
is in practice the only part of the OTU dimension that sits in the dna-extension. All the other term/values relating to the OTU, e.g. taxonomy-related fields are currently accommodated in the Occurrence Core. NB: The only value in a Metabarcoding Dataset that is truly occurrence specific is the read_count (the read abundance of OTU X in Sample Y) - currently accommodated in Organism quantity
of Occurrence Core.
These facts could be used to reduce the redundancy of the DwC-A for such datasets in several ways. Some ideas building on the above:
Organism quantity
) is the only occurrence specific value, and somehow keep the contingency table structure as a more central part.Some observations in relation to the toy dataset structure, MDT, etc.
Sample size value
(total read_count per sample) needs to be done per marker.@sformel-usgs: Do you have the underlying fictional OTU tables, so we may use these in this exploration?
@tobiasgf I revised the DwC-A slightly based on your comments. I updated the link above with the new DwC-A (v1.5 ). To your last point, I understand the concern over generally recommending this structure, and I agree that it shouldn't be presented as the optimal structure for all datasets. But this is an important option for complicated projects who are otherwise struggling to flatten their data for DwC.
Some observations in relation to the toy dataset structure, MDT, etc.
The structure is clear and logical seen from a "sampling design perspective". In a realistic case, the samples and lab processing would be 1:1 with the chart.
The sequencing data would realistically come in two pools, and be prepared separately in CO1 and 16S dedicated bioinformatic pipelines to produce two separate OTU tables (or similar, e.g. BIOM) AND analysed "ecologically" as two separate datasets, maybe combined with other CO1 and 16S data, respectively.
Re-users of the data – as metabarcodoing data (and not just dots on a map) – would normally want these marker-gene separated datasets (CO1 and 16S) in an OTU-table like structure.
rgbif
and/or robis
. _Calculating of Sample size value (total readcount per sample) needs to be done per marker.
My worries for generally recommending a data structure like this (as opposed to separate datasets): 1) It may be error prone to prepare "by hand",
...2) If preparing and publishing without using the MDT, the OTU tables (and marker-gene specific datasets) are difficult to reconstruct for people who want to re-use this as metabarcoding data.
Hi all, excited to see this discussion going on as we are hoping to publish a number of datasets to GBIF/OBIS in the next few months and we would ideally want to be able to leverage the new data model and implement it in our edna2obis pipeline, so our data can be mobilized in an automated way. As was mentioned in the meeting today, there are really two types of data re-users and us at NOAA want to make sure our data and metadata are connected in a way that serves both users.
1) People who want all the data from a specific cruise. Ideally, they can go to the Dataset landing page for a cruise, and see some nice summary stats about the occurrences and markers used (I think OBIS does this particularly well).
The user should ideally be able to download the data in separate OTU and taxonomy tables for each marker from the website and from the mapper or API, or at the very least download the DWC-A and use code to get those separate OTU tables. Any user (even one who isn't already familiar with the dataset) should be able to associate occurrences from different markers that derive from the same water sample.
2) People who access the occurrences as "dots on a map". When downloading the data, these users should get all of metadata associated with the occurrence (the event sample metadata, the DNA-derived data, taxonomy, the project metadata like who collected it, the total read_count per marker per sample, ect).
I know some of this functionality, like including OTU tables, is in the future, but if there was guidance on how to format our data files now so that they can be appropriately linked in the future for both of these user needs that would be great. Thanks for your work on this!!
...and almost forgot, here are the OTU tables and other files. GitHub won't let me attach the fasta file, but I can email it. Happy to revise any of these if needed:
dna_metadata.csv samp_metadata.csv tax_table.csv 16S_otu_table.csv COI_otu_table.csv
TL;DR I've created a toy dataset to help model publishing of DNA with hierachical events. I wasn't sure of the best place to communicate this, so I chose this repo. Happy to share it through any other channels that are useful.
occurrenceID
was recently added to the DNA extension (https://github.com/gbif/rs.gbif.org/issues/136). @pieterprovoost and I spoke about the need for an example dataset to model this in OBIS. I'm not sure if GBIF already has other examples they are working with, but perhaps this example dataset will also be useful for the MDT (@tobiasgf ) and DwC2 (@timrobertson100 ). We also need to update the Publishing DNA guide to demonstrate this utility.I've created a toy dataset with IPT 3.1.0 that uses Event Core + Occurrence Extension + DNA Extension:
dwca-test_addition_occurrenceid_dna_extension-v1.5.zip
A couple of notes:
fieldNotes
,samplingProtocol
, andfootprintWKT
that only apply to a few parent events.DNA_sequence
keeps us from realizing any reduction in the number of rows of the DNA extension, because it is 1:1 with occurrenceID. However, I think most (if not all) other terms can be linked to parent events (e.g. water sample, library prep) rather than the occurrence. Maybe there is some way to separate these chunks of information and reduce the volume of the extension.Here is what the event structure looks like: