modeling of updated DNA extension; updating guidance to demo nested events

TL;DR I've created a toy dataset to help model publishing of DNA with hierachical events. I wasn't sure of the best place to communicate this, so I chose this repo. Happy to share it through any other channels that are useful.

occurrenceID was recently added to the DNA extension (https://github.com/gbif/rs.gbif.org/issues/136). @pieterprovoost and I spoke about the need for an example dataset to model this in OBIS. I'm not sure if GBIF already has other examples they are working with, but perhaps this example dataset will also be useful for the MDT (@tobiasgf ) and DwC2 (@timrobertson100 ). We also need to update the Publishing DNA guide to demonstrate this utility.

I've created a toy dataset with IPT 3.1.0 that uses Event Core + Occurrence Extension + DNA Extension:

dwca-test_addition_occurrenceid_dna_extension-v1.5.zip

A couple of notes:

I thought it would be more useful to have a balanced, simple dataset than a real-world example, for now. This includes 160 occurrences nested in 6 parent events (Cruise > Site > Station > Sample > Replicate > Library).
There are a few terms, like fieldNotes, samplingProtocol, and footprintWKT that only apply to a few parent events.
One of the things I realized in making this dataset is that the term DNA_sequence keeps us from realizing any reduction in the number of rows of the DNA extension, because it is 1:1 with occurrenceID. However, I think most (if not all) other terms can be linked to parent events (e.g. water sample, library prep) rather than the occurrence. Maybe there is some way to separate these chunks of information and reduce the volume of the extension.

Here is what the event structure looks like:


graph LR

subgraph event["Event Core"]

  Cruise --> Site1 & Site2

  Site1 --> Station1A & Station1B
  Site2 --> Station2A & Station2B

  Station1A --> 1ASS["Surface Sample"] & 1ABS["Bottom Sample"]
  Station1B --> 1BSS["Surface Sample"] & 1BBS["Bottom Sample"]
  Station2A --> 2ASS["Surface Sample"] & 2ABS["Bottom Sample"]
  Station2B --> 2BSS["Surface Sample"] & 2BBS["Bottom Sample"]

  1ASS --> 1ASSA["Replicate A"] & 1ASSB["Replicate B"]
  1ABS --> 1ABSA["Replicate A"] & 1ABSB["Replicate B"]
  1BSS --> 1BSSA["Replicate A"] & 1BSSB["Replicate B"]
  1BBS --> 1BBSA["Replicate A"] & 1BBSB["Replicate B"]
  2ASS --> 2ASSA["Replicate A"] & 2ASSB["Replicate B"]
  2ABS --> 2ABSA["Replicate A"] & 2ABSB["Replicate B"]
  2BSS --> 2BSSA["Replicate A"] & 2BSSB["Replicate B"]
  2BBS --> 2BBSA["Replicate A"] & 2BBSB["Replicate B"]

  1ASSA --> 1ASSA16S["16S Library"] & 1ASSACOI["COI Library"]
  1ABSA --> 1ABSA16S["16S Library"] & 1ABSACOI["COI Library"]
  1BSSA --> 1BSSA16S["16S Library"] & 1BSSACOI["COI Library"]
  1BBSA --> 1BBSA16S["16S Library"] & 1BBSACOI["COI Library"]
  2ASSA --> 2ASSA16S["16S Library"] & 2ASSACOI["COI Library"]
  2ABSA --> 2ABSA16S["16S Library"] & 2ABSACOI["COI Library"]
  2BSSA --> 2BSSA16S["16S Library"] & 2BSSACOI["COI Library"]
  2BBSA --> 2BBSA16S["16S Library"] & 2BBSACOI["COI Library"]
  1ASSB --> 1ASSB16S["16S Library"] & 1ASSBCOI["COI Library"]
  1ABSB --> 1ABSB16S["16S Library"] & 1ABSBCOI["COI Library"]
  1BSSB --> 1BSSB16S["16S Library"] & 1BSSBCOI["COI Library"]
  1BBSB --> 1BBSB16S["16S Library"] & 1BBSBCOI["COI Library"]
  2ASSB --> 2ASSB16S["16S Library"] & 2ASSBCOI["COI Library"]
  2ABSB --> 2ABSB16S["16S Library"] & 2ABSBCOI["COI Library"]
  2BSSB --> 2BSSB16S["16S Library"] & 2BSSBCOI["COI Library"]
  2BBSB --> 2BBSB16S["16S Library"] & 2BBSBCOI["COI Library"]

end

subgraph occ["Occurrence Extension"]

  1ASSA16S --> occ1["5 occurrences (ASVs)"]
  1ASSACOI --> occ2["5 occurrences (ASVs)"]
  1ABSA16S --> occ3["5 occurrences (ASVs)"]
  1ABSACOI --> occ4["5 occurrences (ASVs)"]
  1BSSA16S --> occ5["5 occurrences (ASVs)"]
  1BSSACOI --> occ6["5 occurrences (ASVs)"]
  1BBSA16S --> occ7["5 occurrences (ASVs)"]
  1BBSACOI --> occ8["5 occurrences (ASVs)"]
  2ASSA16S --> occ9["5 occurrences (ASVs)"]
  2ASSACOI --> occ10["5 occurrences (ASVs)"]
  2ABSA16S --> occ11["5 occurrences (ASVs)"]
  2ABSACOI --> occ12["5 occurrences (ASVs)"]
  2BSSA16S --> occ13["5 occurrences (ASVs)"]
  2BSSACOI --> occ14["5 occurrences (ASVs)"]
  2BBSA16S --> occ15["5 occurrences (ASVs)"]
  2BBSACOI --> occ16["5 occurrences (ASVs)"]
  1ASSB16S --> occ21["5 occurrences (ASVs)"]
  1ASSBCOI --> occ22["5 occurrences (ASVs)"]
  1ABSB16S --> occ23["5 occurrences (ASVs)"]
  1ABSBCOI --> occ24["5 occurrences (ASVs)"]
  1BSSB16S --> occ25["5 occurrences (ASVs)"]
  1BSSBCOI --> occ26["5 occurrences (ASVs)"]
  1BBSB16S --> occ27["5 occurrences (ASVs)"]
  1BBSBCOI --> occ28["5 occurrences (ASVs)"]
  2ASSB16S --> occ29["5 occurrences (ASVs)"]
  2ASSBCOI --> occ30["5 occurrences (ASVs)"]
  2ABSB16S --> occ31["5 occurrences (ASVs)"]
  2ABSBCOI --> occ32["5 occurrences (ASVs)"]
  2BSSB16S --> occ33["5 occurrences (ASVs)"]
  2BSSBCOI --> occ34["5 occurrences (ASVs)"]
  2BBSB16S --> occ35["5 occurrences (ASVs)"]
  2BBSBCOI --> occ36["5 occurrences (ASVs)"]
end

Very nice to have a simple, but realistically structured dataset to work with.

In relation to bullet 3 This touches on an aspect that relates to what @pragermh brings up https://github.com/gbif/rs.gbif.org/issues/136#issuecomment-2136912091. If you look at eDNA data from the OTU table perspective, there are two dimensions with metadata - metadata that relates to the OTUs and to the Samples, respectively. All Sample metadata is in reality connected to the Event (or a parentEvent), and the DNA_sequence is in practice the only part of the OTU dimension that sits in the dna-extension. All the other term/values relating to the OTU, e.g. taxonomy-related fields are currently accommodated in the Occurrence Core. NB: The only value in a Metabarcoding Dataset that is truly occurrence specific is the read_count (the read abundance of OTU X in Sample Y) - currently accommodated in Organism quantity of Occurrence Core.

These facts could be used to reduce the redundancy of the DwC-A for such datasets in several ways. Some ideas building on the above:

Pull out the chunks of the DNA-derived extension fields (all but DNA_sequence) and link these to Event.
Hash DNA_sequence and use the hash as an Seq-ID and put that into an occurrence core field, removing the need to have it in the extension. Instead the OTU information (as a minimum: the sequence) could be pulled from and index of Seq-IDs and their associated sequences (and other derived values).
Utilize the fact that the read_count (Organism quantity) is the only occurrence specific value, and somehow keep the contingency table structure as a more central part.

Some observations in relation to the toy dataset structure, MDT, etc.

The structure is clear and logical seen from a "sampling design perspective". In a realistic case, the samples and lab processing would be 1:1 with the chart.
The sequencing data would realistically come in two pools, and be prepared separately in CO1 and 16S dedicated bioinformatic pipelines to produce two separate OTU tables (or similar, e.g. BIOM) AND analysed "ecologically" as two separate datasets, maybe combined with other CO1 and 16S data, respectively.
Re-users of the data – as metabarcodoing data (and not just dots on a map) – would normally want these marker-gene separated datasets (CO1 and 16S) in an OTU-table like structure.
Calculating of Sample size value (total read_count per sample) needs to be done per marker.
All this may potentially be accommodated when using the MDT, by having two separate datasets (OTU table + sample metadata + taxonomy metadata + studylevel values), where the events/samples only relate to the finest level (16S and CO1 library) with those event linked to the first parent event, and providing all the common parent event as one common table. (Needs some exploration).
This would mean that the intact marker-specific OTU tables are accessible in the MDT files, but that the sampling structure is what is mediated in the DwC-A.
My worries for generally recommending a data structure like this (as opposed to separate datasets): 1) It may be error prone to prepare "by hand", 2) If preparing and publishing without using the MDT, the OTU tables (and marker-gene specific datasets) are difficult to reconstruct for people who want to re-use this as metabarcoding data.

@sformel-usgs: Do you have the underlying fictional OTU tables, so we may use these in this exploration?

@tobiasgf I revised the DwC-A slightly based on your comments. I updated the link above with the new DwC-A (v1.5 ). To your last point, I understand the concern over generally recommending this structure, and I agree that it shouldn't be presented as the optimal structure for all datasets. But this is an important option for complicated projects who are otherwise struggling to flatten their data for DwC.

Some observations in relation to the toy dataset structure, MDT, etc.

The structure is clear and logical seen from a "sampling design perspective". In a realistic case, the samples and lab processing would be 1:1 with the chart.
- I can imagine situations in which a sample and/or dna extract would be subdivided and shared across multiple labs, but your point is taken that this is most frequently the case.
The sequencing data would realistically come in two pools, and be prepared separately in CO1 and 16S dedicated bioinformatic pipelines to produce two separate OTU tables (or similar, e.g. BIOM) AND analysed "ecologically" as two separate datasets, maybe combined with other CO1 and 16S data, respectively.
- Yes, and some providers would prefer to publish these separate, while others prefer to publish them together, so it's important to have that flexibility.
Re-users of the data – as metabarcodoing data (and not just dots on a map) – would normally want these marker-gene separated datasets (CO1 and 16S) in an OTU-table like structure.
- Makes sense. The MDT could potentially have a tool to reverse engineer a DwC-A. More simply, we could whip up some functions to do this and include it in rgbif and/or robis.
_Calculating of Sample size value (total readcount per sample) needs to be done per marker.
- Pretty certain I have this correct, but let me know if it still doesn't look right.

My worries for generally recommending a data structure like this (as opposed to separate datasets): 1) It may be error prone to prepare "by hand",

I completely agree.

...2) If preparing and publishing without using the MDT, the OTU tables (and marker-gene specific datasets) are difficult to reconstruct for people who want to re-use this as metabarcoding data.

Yes, so as we chatted about at our meeting today, maybe the MDT should be a tool to translate data in both directions, although the sample/study metadata might not be simple to translate back and forth.

Hi all, excited to see this discussion going on as we are hoping to publish a number of datasets to GBIF/OBIS in the next few months and we would ideally want to be able to leverage the new data model and implement it in our edna2obis pipeline, so our data can be mobilized in an automated way. As was mentioned in the meeting today, there are really two types of data re-users and us at NOAA want to make sure our data and metadata are connected in a way that serves both users.

1) People who want all the data from a specific cruise. Ideally, they can go to the Dataset landing page for a cruise, and see some nice summary stats about the occurrences and markers used (I think OBIS does this particularly well).

The user should ideally be able to download the data in separate OTU and taxonomy tables for each marker from the website and from the mapper or API, or at the very least download the DWC-A and use code to get those separate OTU tables. Any user (even one who isn't already familiar with the dataset) should be able to associate occurrences from different markers that derive from the same water sample.

2) People who access the occurrences as "dots on a map". When downloading the data, these users should get all of metadata associated with the occurrence (the event sample metadata, the DNA-derived data, taxonomy, the project metadata like who collected it, the total read_count per marker per sample, ect).

I know some of this functionality, like including OTU tables, is in the future, but if there was guidance on how to format our data files now so that they can be appropriately linked in the future for both of these user needs that would be great. Thanks for your work on this!!

...and almost forgot, here are the OTU tables and other files. GitHub won't let me attach the fasta file, but I can email it. Happy to revise any of these if needed:

dna_metadata.csv samp_metadata.csv tax_table.csv 16S_otu_table.csv COI_otu_table.csv

gbif / doc-publishing-dna-derived-data

modeling of updated DNA extension; updating guidance to demo nested events #213