Add field for source of occurrence data?

shawndove commented 11 months ago

Description: I would like to suggest the addition of a field to occurrence cubes that contains information about the source of the observations.

Background: I am implementing general biodiversity indicators, such as species richness and evenness, using the occurrence cubes. One of the main challenges in implementation is that there are temporal, spatial, and taxonomic biases that I am attempting to mitigate or account for in some way. However, there are different biases associated with different data source types, and it would be very useful to be able to deal with these biases separately, or at least to compare them.

Suggested Implementation: One possible way to add this field would be to include it as a new column, which could be populated with either numerical codes linked to source types (e.g., citizen science data, scientific survey, museum collection) or with source names. However, I do not know what is feasible.

peterdesmet commented 11 months ago

Hi @shawndove

Dataset source

GBIF currently doesn't have source information (citizen-science, survey, etc.) per dataset, at least not in a structured way. I think there were talks to add machine-generated tags to datasets (e.g. source: citizen science, spatial-resolution: gridded), but I don't know if that is still on the radar. To work as a dimension in the cube, each dataset should also just have a single source assigned (and not more than one).

What currently is possible however, are:

basisOfRecord

Each occurrence has a single basisOfRecord (provided by the publisher), which broadly characterizes it as PRESERVED_SPECIMEN, MACHINE_OBSERVATION, HUMAN_OBSERVATION, etc. This is not at the level you are looking for, but can potentially help as a dimension. It is also possible to filter out certain of these categories before creating a cube. E.g. this is a query for machine observations only: https://www.gbif.org/occurrence/search?basis_of_record=MACHINE_OBSERVATION&advanced=1

datasetKey

Each occurrence has a single datasetKey (the identifier of the dataset to which the occurrence belongs). This too can be used as a filter or dimension. As a dimension, it would give you a much higher cardinality than you are looking for, but you could look up and assign your own "source" to each datasetKey to reduce the number of categories to something that works for you.

Both basisOfRecord and datasetKey have been suggested as dimensions in the spec.

shawndove commented 11 months ago

Thank you @peterdesmet for a clear explanation of the options. I think the datasetKey would be appropriate for my use case.

gbif / occurrence-cube