Possible examples and need for sharing mixed observation and vouchered specimen record datasets

Greetings GBIF,

RE: datasets containing both observation records (e. g. remote monitoring, human observation) and vouchered specimen data records

Scenario: In monitoring native bees for a given region on the planet (i. e. humans looking at bees and identifying them, maybe imaging them) data are to be collected in a spreadsheet. To start with, for each morphospecies encountered at a given monitoring site, a single specimen will be collected and then vouchered in a natural science collection. Data about this specimen will be entered in a record row in the aforementioned spreadsheet.

Questions:

[ ] What dataset class (if any) would fit? That is, I can't quite figure this out because the "basis of record" for each row in the occurrence.txt file might vary (from "PreservedSpecimen" to "HumanObservation" to "MachineObservation" to "Observation" to "Occurrence").
- [ ] Is that okay for the basisOfRecord to vary inside a dataset? If yes, would it be okay to use Occurrence dataset class?
- [ ] In the above scenario, it's a certainty there will be an explicit published (on Zenodo?) monitoring protocol that includes the when to collect / how to collect for vouchering. Would it be preferable then to use the Sampling event dataset class instead?
[ ] Is it possible to share / publish on GBIF such a "mixed" dataset? (Yes / No / Maybe)
[ ] If yes, does such a dataset exist that a new project could use as a starting point to expand on for this particular use case? Please share links to any such datasets?
[ ] In any such dataset, say, the sampling event dataset class or the Occurrrence dataset class can we add more dwc fields than you list? Neither one of them (in your examples) include all the fields likely to be collected in the above scenario.

Other thoughts

For anyone searching GBIF, using dwc:basisOfRecord, would they, as a user expect their downloaded datasets to be mixed for basisOfRecord values? (IOW, if I select dwc:basisOfRecord = HumanObservation, would I expect or be confused by a result set that included all the records from publishing a mixed dataset that contains vouchered specimen records inter-mixed with observation records)?

@MattBlissett @timrobertson100 and GBIF folks thanks for your help. Also please note I wasn't sure which GBIF repository this ticket would go in. Perhaps it will need to be moved to a different repo. Tagging @seltmann

Realize I'm not the people you were reaching out to on this but I am currently working with a data provider from USDA with exactly this type of information collected. For one project they have vouchered specimens (basisOfRecord = PreservedSpecimen), video observations (basisOfRecord = MachineObservation), and visual observations (basisOfRecord = HumanObservation). In this instance we have elected to create three separate datasets using occurrence core but I suggested that out of convenience for the data provider and I think it would work equally well to combine them into one dataset especially since they were all collected at the same location, time, and part of the same project. Also you can use the occurrence core and provide sampling event type of information and GBIF will identify the events still (example dataset here which was published using occurrence core but you can see the events are still identified on the GBIF dataset landing page).

For your other thoughts- users get back data of mixed basisOfRecord all the time because they are usually searching by taxon or location and there will be a mix of datasets providing the records for that location or for that taxon. So unless they explicitly remove a certain kind of basisOfRecord they will have a mixed type in their download.

Hopefully GBIF-S will correct me where I've misspoken.

@albenson-usgs

Realize I'm not the people you were reaching out to on this but I am currently working with a data provider from USDA with exactly this type of information collected.

Marvelous, your response is wonderful and timely. Thank you for taking the time to offer your experience, and examples! Much appreciated.

We can both look forward to further insights from GBIF-S.

Thanks @debpaul @albenson-usgs

I'm sure others will jump in, but I will make a start and try and provide some background information.

Firstly, Abby is correct in her reply (thank you). BasisOfRecord can be mixed within a dataset and is commonly varied in a download.

Secondly, we're aware that the current dataset classes are insufficient and confusing, and we have a discussion underway to provide a better categorization of datasets. An example of where confusion appears is that checklists can have occurrence data, and an occurrence dataset can hold IDs for sampling events in the occurrence core.

The origin of the dataset classes comes from the core record type used in the DwC-A format. Checklists using Taxon, Occurrence using Occurrence and Sampling Event using Event. However, that is not enforced by the system and it just represents the option chosen when registering a dataset (i.e. "I am registering a new dataset of this type which can be indexed from this archive" where type and what is the in the archive aren't verified).

Please also note that the dataset type does not appear in the occurrence search interface of GBIF, nor the occurrence API. They are all occurrence records at this point, with a basisOfRecord. It only appears in the dataset listing and search along with summary statistics (e.g. by country).

So where does that leave you?

My advice would be to:

Focus on which core record type is best for your data (Occurrence or Event). That will be influenced by what extensions (images, measurements etc) you wish to attach. It's fine to mix basis of record within the dataset
Use the measurementOrFact extension for things that don't fit into the core data
Focus on the best metadata you can (EML) and e.g. document sampling methodology
When registering in GBIF, if the dataset holds sufficient information that it supports statistical analysis based on a sampling methodology (e.g. abundance, or community composition etc) then select sampling event, otherwise choose occurrence.

With that said, I may be out of touch and you may get a revised suggestion on 4).

Hi Deb, All, I am moving this over to portal-feedback, as the mobilization repo has a slightly different context.

Thanks, @albenson-usgs and @timrobertson100 for your responses - spot on! Just to add that yes, it would indeed be preferable to use the "sampling event" class for datasets that have well-documented sampling methodology and other relevant metadata. This said, with the current limitations of the DwC-A "star schema" (linked extensions only one level deep), the decision is often rather based on the extension data that should/need to be included. As Tim says, we are aware of that particular limitation, and working on solutions in model discussions.

gbif / portal-feedback

Possible examples and need for sharing mixed observation and vouchered specimen record datasets #4432