gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Possible examples and need for sharing mixed observation and vouchered specimen record datasets #4432

Open debpaul opened 1 year ago

debpaul commented 1 year ago

Greetings GBIF,

RE: datasets containing both observation records (e. g. remote monitoring, human observation) and vouchered specimen data records

Scenario: In monitoring native bees for a given region on the planet (i. e. humans looking at bees and identifying them, maybe imaging them) data are to be collected in a spreadsheet. To start with, for each morphospecies encountered at a given monitoring site, a single specimen will be collected and then vouchered in a natural science collection. Data about this specimen will be entered in a record row in the aforementioned spreadsheet.

Questions:

Other thoughts

@MattBlissett @timrobertson100 and GBIF folks thanks for your help. Also please note I wasn't sure which GBIF repository this ticket would go in. Perhaps it will need to be moved to a different repo. Tagging @seltmann

albenson-usgs commented 1 year ago

Realize I'm not the people you were reaching out to on this but I am currently working with a data provider from USDA with exactly this type of information collected. For one project they have vouchered specimens (basisOfRecord = PreservedSpecimen), video observations (basisOfRecord = MachineObservation), and visual observations (basisOfRecord = HumanObservation). In this instance we have elected to create three separate datasets using occurrence core but I suggested that out of convenience for the data provider and I think it would work equally well to combine them into one dataset especially since they were all collected at the same location, time, and part of the same project. Also you can use the occurrence core and provide sampling event type of information and GBIF will identify the events still (example dataset here which was published using occurrence core but you can see the events are still identified on the GBIF dataset landing page).

For your other thoughts- users get back data of mixed basisOfRecord all the time because they are usually searching by taxon or location and there will be a mix of datasets providing the records for that location or for that taxon. So unless they explicitly remove a certain kind of basisOfRecord they will have a mixed type in their download.

Hopefully GBIF-S will correct me where I've misspoken.

debpaul commented 1 year ago

@albenson-usgs

Realize I'm not the people you were reaching out to on this but I am currently working with a data provider from USDA with exactly this type of information collected.

Marvelous, your response is wonderful and timely. Thank you for taking the time to offer your experience, and examples! Much appreciated.

We can both look forward to further insights from GBIF-S.

timrobertson100 commented 1 year ago

Thanks @debpaul @albenson-usgs

I'm sure others will jump in, but I will make a start and try and provide some background information.

Firstly, Abby is correct in her reply (thank you). BasisOfRecord can be mixed within a dataset and is commonly varied in a download.

Secondly, we're aware that the current dataset classes are insufficient and confusing, and we have a discussion underway to provide a better categorization of datasets. An example of where confusion appears is that checklists can have occurrence data, and an occurrence dataset can hold IDs for sampling events in the occurrence core.

The origin of the dataset classes comes from the core record type used in the DwC-A format. Checklists using Taxon, Occurrence using Occurrence and Sampling Event using Event. However, that is not enforced by the system and it just represents the option chosen when registering a dataset (i.e. "I am registering a new dataset of this type which can be indexed from this archive" where type and what is the in the archive aren't verified).

Please also note that the dataset type does not appear in the occurrence search interface of GBIF, nor the occurrence API. They are all occurrence records at this point, with a basisOfRecord. It only appears in the dataset listing and search along with summary statistics (e.g. by country).

So where does that leave you?

My advice would be to:

  1. Focus on which core record type is best for your data (Occurrence or Event). That will be influenced by what extensions (images, measurements etc) you wish to attach. It's fine to mix basis of record within the dataset
  2. Use the measurementOrFact extension for things that don't fit into the core data
  3. Focus on the best metadata you can (EML) and e.g. document sampling methodology
  4. When registering in GBIF, if the dataset holds sufficient information that it supports statistical analysis based on a sampling methodology (e.g. abundance, or community composition etc) then select sampling event, otherwise choose occurrence.

With that said, I may be out of touch and you may get a revised suggestion on 4).

ahahn-gbif commented 1 year ago

Hi Deb, All, I am moving this over to portal-feedback, as the mobilization repo has a slightly different context.

Thanks, @albenson-usgs and @timrobertson100 for your responses - spot on! Just to add that yes, it would indeed be preferable to use the "sampling event" class for datasets that have well-documented sampling methodology and other relevant metadata. This said, with the current limitations of the DwC-A "star schema" (linked extensions only one level deep), the decision is often rather based on the extension data that should/need to be included. As Tim says, we are aware of that particular limitation, and working on solutions in model discussions.