gbif / rs.gbif.org

GBIF machine-readable resources
https://rs.gbif.org
13 stars 12 forks source link

Add occurrenceID to DNA derived data extension #136

Closed thomasstjerne closed 4 months ago

thomasstjerne commented 6 months ago

In order to use the DNA derived data extension with eventCore, it has been requested by @pragermh from the Swedish Biodiversity Data Infrastructure (SBDI) and @LynnDelgat from OBIS to add occurrenceID, following the same approach as in the extended measurements and facts extension.

dagendresen commented 6 months ago

Maybe more useful to add the eventID to the DNA extensions...? With the inferred "occurrences" also connected to the same eventID.

thomasstjerne commented 6 months ago

The rationale here is that most attributes in the DNA extension are actually about the event / sample. This gives a lot of redundancy and some users have seen GigaByte archives because of this. Using the eMOF approach, all data about an event could be given in one row pointing only to the event, whereas the sequences could be given with foreign keys to both the Occurrence and the Event.

If the extension is used with EventCore, I think the eventID would implicitly be given as coreID. On the other hand, when used with OccurrenceCore, the eventID should be on the Core i.e. OccurrenceCore:eventID

timrobertson100 commented 6 months ago

This makes a lot of sense to me, and something that is increasingly popping up in other discussions (e.g. media captured at an event and occurrence level)

pragermh commented 6 months ago

Thanks for bringing this up for discussion! If I understand this correctly, the current DNA derived data extension requires me to use Occurrence core (because we report sequences at that level), which leads to unnecessarily large files for datasets with hundreds or thousands of occurrences per sample. The worst cases occur when this is combined with many tens of contextual parameters, as the Occurrence core then requires me to add rows for each occurrence in emof - or is there a way around this?

But when I think of it, the DNA sequence is not really an occurrence-level attribute either. If we find the same ASV in hundreds of samples, we would still have to repeat the same say 400 char string hundreds of times. Would it be possible to add a taxonID col to the DNA extension as well?

pragermh commented 6 months ago

Sorry, a taxonID probably will not help if I still have to include a sampleID though...

thomasstjerne commented 6 months ago

@pragermh I get your point. What you basically suggest is the structure of an ASV/OTU table. I.e. sampling events connected to taxa that each has a sequence. I think this is good input for coming discussions around the data model.