gbif / doc-publishing-dna-derived-data

This guide shows how to publish DNA-derived spatiotemporal biodiversity data and make it discoverable through national and global biodiversity data discovery platforms. Based on experiences from Australia, Norway, Sweden, UNITE, and GBIF.
https://doi.org/10.35035/doc-vf1a-nr22
Other
2 stars 7 forks source link

modeling of updated DNA extension; updating guidance to demo nested events #213

Open sformel-usgs opened 5 days ago

sformel-usgs commented 5 days ago

TL;DR I've created a toy dataset to help model publishing of DNA with hierachical events. I wasn't sure of the best place to communicate this, so I chose this repo. Happy to share it through any other channels that are useful.

occurrenceID was recently added to the DNA extension (https://github.com/gbif/rs.gbif.org/issues/136). @pieterprovoost and I spoke about the need for an example dataset to model this in OBIS. I'm not sure if GBIF already has other examples they are working with, but perhaps this example dataset will also be useful for the MDT (@tobiasgf ) and DwC2 (@timrobertson100 ). We also need to update the Publishing DNA guide to demonstrate this utility.

I've created a toy dataset with IPT 3.1.0 that uses Event Core + Occurrence Extension + DNA Extension:

dwca-test_addition_occurrenceid_dna_extension-v1.5.zip

A couple of notes:

  1. I thought it would be more useful to have a balanced, simple dataset than a real-world example, for now. This includes 160 occurrences nested in 6 parent events (Cruise > Site > Station > Sample > Replicate > Library).
  2. There are a few terms, like fieldNotes, samplingProtocol, and footprintWKT that only apply to a few parent events.
  3. One of the things I realized in making this dataset is that the term DNA_sequence keeps us from realizing any reduction in the number of rows of the DNA extension, because it is 1:1 with occurrenceID. However, I think most (if not all) other terms can be linked to parent events (e.g. water sample, library prep) rather than the occurrence. Maybe there is some way to separate these chunks of information and reduce the volume of the extension.

Here is what the event structure looks like:


graph LR

subgraph event["Event Core"]

  Cruise --> Site1 & Site2

  Site1 --> Station1A & Station1B
  Site2 --> Station2A & Station2B

  Station1A --> 1ASS["Surface Sample"] & 1ABS["Bottom Sample"]
  Station1B --> 1BSS["Surface Sample"] & 1BBS["Bottom Sample"]
  Station2A --> 2ASS["Surface Sample"] & 2ABS["Bottom Sample"]
  Station2B --> 2BSS["Surface Sample"] & 2BBS["Bottom Sample"]

  1ASS --> 1ASSA["Replicate A"] & 1ASSB["Replicate B"]
  1ABS --> 1ABSA["Replicate A"] & 1ABSB["Replicate B"]
  1BSS --> 1BSSA["Replicate A"] & 1BSSB["Replicate B"]
  1BBS --> 1BBSA["Replicate A"] & 1BBSB["Replicate B"]
  2ASS --> 2ASSA["Replicate A"] & 2ASSB["Replicate B"]
  2ABS --> 2ABSA["Replicate A"] & 2ABSB["Replicate B"]
  2BSS --> 2BSSA["Replicate A"] & 2BSSB["Replicate B"]
  2BBS --> 2BBSA["Replicate A"] & 2BBSB["Replicate B"]

  1ASSA --> 1ASSA16S["16S Library"] & 1ASSACOI["COI Library"]
  1ABSA --> 1ABSA16S["16S Library"] & 1ABSACOI["COI Library"]
  1BSSA --> 1BSSA16S["16S Library"] & 1BSSACOI["COI Library"]
  1BBSA --> 1BBSA16S["16S Library"] & 1BBSACOI["COI Library"]
  2ASSA --> 2ASSA16S["16S Library"] & 2ASSACOI["COI Library"]
  2ABSA --> 2ABSA16S["16S Library"] & 2ABSACOI["COI Library"]
  2BSSA --> 2BSSA16S["16S Library"] & 2BSSACOI["COI Library"]
  2BBSA --> 2BBSA16S["16S Library"] & 2BBSACOI["COI Library"]
  1ASSB --> 1ASSB16S["16S Library"] & 1ASSBCOI["COI Library"]
  1ABSB --> 1ABSB16S["16S Library"] & 1ABSBCOI["COI Library"]
  1BSSB --> 1BSSB16S["16S Library"] & 1BSSBCOI["COI Library"]
  1BBSB --> 1BBSB16S["16S Library"] & 1BBSBCOI["COI Library"]
  2ASSB --> 2ASSB16S["16S Library"] & 2ASSBCOI["COI Library"]
  2ABSB --> 2ABSB16S["16S Library"] & 2ABSBCOI["COI Library"]
  2BSSB --> 2BSSB16S["16S Library"] & 2BSSBCOI["COI Library"]
  2BBSB --> 2BBSB16S["16S Library"] & 2BBSBCOI["COI Library"]

end

subgraph occ["Occurrence Extension"]

  1ASSA16S --> occ1["5 occurrences (ASVs)"]
  1ASSACOI --> occ2["5 occurrences (ASVs)"]
  1ABSA16S --> occ3["5 occurrences (ASVs)"]
  1ABSACOI --> occ4["5 occurrences (ASVs)"]
  1BSSA16S --> occ5["5 occurrences (ASVs)"]
  1BSSACOI --> occ6["5 occurrences (ASVs)"]
  1BBSA16S --> occ7["5 occurrences (ASVs)"]
  1BBSACOI --> occ8["5 occurrences (ASVs)"]
  2ASSA16S --> occ9["5 occurrences (ASVs)"]
  2ASSACOI --> occ10["5 occurrences (ASVs)"]
  2ABSA16S --> occ11["5 occurrences (ASVs)"]
  2ABSACOI --> occ12["5 occurrences (ASVs)"]
  2BSSA16S --> occ13["5 occurrences (ASVs)"]
  2BSSACOI --> occ14["5 occurrences (ASVs)"]
  2BBSA16S --> occ15["5 occurrences (ASVs)"]
  2BBSACOI --> occ16["5 occurrences (ASVs)"]
  1ASSB16S --> occ21["5 occurrences (ASVs)"]
  1ASSBCOI --> occ22["5 occurrences (ASVs)"]
  1ABSB16S --> occ23["5 occurrences (ASVs)"]
  1ABSBCOI --> occ24["5 occurrences (ASVs)"]
  1BSSB16S --> occ25["5 occurrences (ASVs)"]
  1BSSBCOI --> occ26["5 occurrences (ASVs)"]
  1BBSB16S --> occ27["5 occurrences (ASVs)"]
  1BBSBCOI --> occ28["5 occurrences (ASVs)"]
  2ASSB16S --> occ29["5 occurrences (ASVs)"]
  2ASSBCOI --> occ30["5 occurrences (ASVs)"]
  2ABSB16S --> occ31["5 occurrences (ASVs)"]
  2ABSBCOI --> occ32["5 occurrences (ASVs)"]
  2BSSB16S --> occ33["5 occurrences (ASVs)"]
  2BSSBCOI --> occ34["5 occurrences (ASVs)"]
  2BBSB16S --> occ35["5 occurrences (ASVs)"]
  2BBSBCOI --> occ36["5 occurrences (ASVs)"]
end
tobiasgf commented 4 days ago

Very nice to have a simple, but realistically structured dataset to work with.

In relation to bullet 3 This touches on an aspect that relates to what @pragermh brings up https://github.com/gbif/rs.gbif.org/issues/136#issuecomment-2136912091. If you look at eDNA data from the OTU table perspective, there are two dimensions with metadata - metadata that relates to the OTUs and to the Samples, respectively. All Sample metadata is in reality connected to the Event (or a parentEvent), and the DNA_sequence is in practice the only part of the OTU dimension that sits in the dna-extension. All the other term/values relating to the OTU, e.g. taxonomy-related fields are currently accommodated in the Occurrence Core. NB: The only value in a Metabarcoding Dataset that is truly occurrence specific is the read_count (the read abundance of OTU X in Sample Y) - currently accommodated in Organism quantity of Occurrence Core.

These facts could be used to reduce the redundancy of the DwC-A for such datasets in several ways. Some ideas building on the above:

Some observations in relation to the toy dataset structure, MDT, etc.

@sformel-usgs: Do you have the underlying fictional OTU tables, so we may use these in this exploration?

sformel-usgs commented 1 day ago

@tobiasgf I revised the DwC-A slightly based on your comments. I updated the link above with the new DwC-A (v1.5 ). To your last point, I understand the concern over generally recommending this structure, and I agree that it shouldn't be presented as the optimal structure for all datasets. But this is an important option for complicated projects who are otherwise struggling to flatten their data for DwC.

Some observations in relation to the toy dataset structure, MDT, etc.

My worries for generally recommending a data structure like this (as opposed to separate datasets): 1) It may be error prone to prepare "by hand",

...2) If preparing and publishing without using the MDT, the OTU tables (and marker-gene specific datasets) are difficult to reconstruct for people who want to re-use this as metabarcoding data.

ksilnoaa commented 1 day ago

Hi all, excited to see this discussion going on as we are hoping to publish a number of datasets to GBIF/OBIS in the next few months and we would ideally want to be able to leverage the new data model and implement it in our edna2obis pipeline, so our data can be mobilized in an automated way. As was mentioned in the meeting today, there are really two types of data re-users and us at NOAA want to make sure our data and metadata are connected in a way that serves both users.

1) People who want all the data from a specific cruise. Ideally, they can go to the Dataset landing page for a cruise, and see some nice summary stats about the occurrences and markers used (I think OBIS does this particularly well).

The user should ideally be able to download the data in separate OTU and taxonomy tables for each marker from the website and from the mapper or API, or at the very least download the DWC-A and use code to get those separate OTU tables. Any user (even one who isn't already familiar with the dataset) should be able to associate occurrences from different markers that derive from the same water sample.

2) People who access the occurrences as "dots on a map". When downloading the data, these users should get all of metadata associated with the occurrence (the event sample metadata, the DNA-derived data, taxonomy, the project metadata like who collected it, the total read_count per marker per sample, ect).

I know some of this functionality, like including OTU tables, is in the future, but if there was guidance on how to format our data files now so that they can be appropriately linked in the future for both of these user needs that would be great. Thanks for your work on this!!

sformel-usgs commented 1 day ago

...and almost forgot, here are the OTU tables and other files. GitHub won't let me attach the fasta file, but I can email it. Happy to revise any of these if needed:

dna_metadata.csv samp_metadata.csv tax_table.csv 16S_otu_table.csv COI_otu_table.csv