gbif / registry

GBIF Registry
Apache License 2.0
34 stars 15 forks source link

Add category to dataset #247

Open timrobertson100 opened 4 years ago

timrobertson100 commented 4 years ago

The current Dataset has type and subtype which is slightly problematic. Type is really indicating the row format used in the DwC-A and causes problems since a checklist can have occurrences, and an occurrence dataset can in fact be the output of sampling event data.

Better use of SubType may help, but I feel could add to more confusion due to the overlap (e.g. an occurrence dataset with subtype sampling event).

Since the API is now so well used and changing this is disruptive, I propose to introduce a new multi-value field named category to categorize datasets. In time we can deprecate type and subtype.

The categories would include the likes of (edited to include suggestions that came in from chat below):

  1. Citizen science data
  2. Observation data
  3. Natural history collection a. Consider separating out fossils as a separate category, to avoid accidental misuse
  4. Single organism sequenced (i.e. tissue from an NHM specimen) a. Consider adding tissue sample as well (which may or may not be sequenced) to aid discovery of preserved tissue without drawing on ambiguous other terms
  5. Environmental DNA and/or metagenomics (e.g. soil sample, water, insect soup etc)
  6. Targeted species detection (PCR-based assays)
  7. Long term monitoring data
  8. Sampling event (where some protocol has been used)
  9. Checklist data
  10. Material citations (e.g. taxonomic treatments in literature)
  11. private sector data a. Consider splitting this into finer categories (e.g. proponent data for environmental impact assessment prior to development) versus other categories (to be defined)
  12. tracking data (i.e. recaptures or GPS tracking of individual organisms)
  13. Machine observation (e.g. camera trap)

The multiple categories would be added to each occurrence record at indexing, allowing an intuitive filter to be added in GBIF.org so people can select on/off the dataset categories that interest them.

CC @ahahn-gbif @MortenHofft for comments in particular

ahahn-gbif commented 4 years ago

Thanks!

~Assuming this will also support metrics (and understanding that multivalue means that a dataset can belong to more than one category), I would like to add~ ~9. private sector data~ ~10. tracking data (i.e. recaptures or GPS tracking of individual organisms)~

[Tim: Thanks - Added above!]

ahahn-gbif commented 4 years ago

Question: should 4. metagenomic (eDNA) be two separate categories? There is quite a difference in interpretation of these data, even though they are both "sequence based" @ManonGros, would you comment?

[Tim Edited to add: I've split them above now, but will change again based on more comments]

jlegind commented 4 years ago

Machine observation seems like a sub category of Sampling Event.

timrobertson100 commented 4 years ago

Machine observation seems like a sub category of Sampling Event.

That's ok isn't it? Because it's multivalue a dataset can be marked as both or just sampling event, or perhaps there are cases where a machine observation would be appropriate where no real sampling protocol is used.

jhnwllr commented 4 years ago

This new category would be free text using the vocab server? Or are we trying to have all the categories defined?

timrobertson100 commented 4 years ago

This new category would be free text using the vocab server? Or are we trying to have all the categories defined?

~Undecided, but at this point we're proposing the categories~

Revised: I'd now suggest the vocabulary server, as detailed later in this thread.

ManonGros commented 4 years ago

Great! I love the idea!

~Just one comment:~ ~> 4. Single organism metagenomic (i.e. tissue from an NHM specimen)~ ~> 5. Environmental eDNA (e.g. soil sample, water, insect soup etc)~

~Number 4 doesn't seem right. What I understand when reading "Single organism metagenomic" is that someone took a gut sample of a cow (for example) and sequenced it, resulting a bunch of occurrences for the gut microbiome. I guess this isn't the idea, is it?~ ~If you mean that tissues from a specimen were sequenced, then I would write something more along the lines of "Single organism sequenced". And actually, we could group metagenomics with eDNA (often eDNA is metagenomics). So in the end, I think we could do something like:~

~4. Single organism sequenced (i.e. tissue from an NHM specimen)~ ~5. Environmental eDNA and/or metagenomics (e.g. soil sample, water, insect soup etc)~

[Tim: Edited with suggestions expressed here - thanks, you indeed understood what I intended!]

Perhaps @thomasstjerne has some thoughts on this?

thomasstjerne commented 4 years ago

Added Targeted species detection (PCR-based assays)

dschigel commented 4 years ago

Thanks @timrobertson100 for making me aware of the thread, very exciting. So far, I found eight likely independent variables that may determine the evidence / dataset type in GBIF. I need to meditate a bit more before presenting my views here, and happy to brainstorm / whiteboard a bit if people are available?

emeyke commented 4 years ago

Keeping track of this as well

dschigel commented 4 years ago

Hello all, I like the idea of sorting datasets and types of evidence, but I am not sure it is most attractive for users to do so using a single filter / vocabulary (but I got the feasibility as put by Tim). I drew some mind maps but don't have time to add pictures here, so just type for your consideration. I started from thinking why would users need to sort dataset / types of evidence? It is a quick way to in/exclude types of data that matter for your cases based on how the evidence was generated and its properties. I came up with 8 independent variables that cross over suggested categorization of the dataset and the basisOfRecord vocabulary as we have today. Note that I think the work independent is important here, though some of the combinations of 1-8 below are impossible in real life.

I am using loose words to describe my thinking, this is not a vocabulary I am suggesting, and there are some unresolved overlaps:

  1. Preservation status of evidence: virtual only or physical: fossil, dead, living (zoos, cultures, gardens, aquaria). Note some thinks like amber are not easy to place, as one can get DNA from amber, there are subfossils etc.). Question: Can I re-examine the physical material? What and where is it?
  2. Integrity / N species: Single & whole (e.g. insect, i.e. contains all its genet within one individual), partial (tissue sample, leaf, fruit body) or mixed specimen (common in moss and lichen collection, when collecting individual species is not possible: but is not intentional sampling e.g. like plankton see 6). Question: Can I study full morphology, or only some traits, or only link museum specimen to DNA sequence?
  3. DNA: not explored, sequence, PCR. Note: this is in between virtual and physical, as DNA or PCR products can be stored for long time (physical), but DNA evidence for species presence, often a sequence, is a machine generated virtual evidence not much different from a digital image or a sound. Question: Can I re-examine the identification, do phylogeny, or all I have is a label name?
  4. Dynamic / Static data. Dynamic: tracking, time series, mark-recapture. Question: can I only study processes, or only patters?
  5. The way the evidence is generated: literature processing, collection digitization, personal observations, systematic sampling. Question: Can I sort the data by reliability of its generation?
  6. For sampling event data, but maybe occurrences, too: presence-only (sampling effort unknown / undocumented), presence-absence, abundance (quantitative). Question: What kinds of statistical analyses are possible?
  7. The way data is packed in GBIF: metadata only, checklist, occurrences only, sampling event. Might include filter by extension used, esp. if we are getting more of those in TDWG. Question: What do I get in my GBIF download, verbatim and GBIF interpreted?
  8. Community that generating the data (perhaps this is more relevant to tagging publishers, but one may need to filter occurrences and datasets by): (groups of) individuals, natural history collections, private sector, marine, citizen science, machine. Some of these are not mutually exclusive: can be "natural history collection" + "citizen science", or "machine". Question: Can I study data trends in a particular demographic sector?

Once again, this is just a capture of unfinished thoughts; it would be nice to brainstorm / whiteboard how good categorization would look like. I was thinking to slice it out as e.g. 1, 7, and 13 in the original post can be simultaneously true. If these are tags and overlap is no problem, then fine. But if this is strict filter, we may need more than only field to capture types of preservation vs. generating community vs. ways of generating vs. quantitativness etc. Feel free to discard if out of scope. I also did not find the collection of BoR discussions, which is applicable here partly.

ManonGros commented 4 years ago

I assume the categorisations would come from us (at least that's how it is at the moment for citizen science datasets) but it would be great if other people could help with the curation as well. Just something to keep in mind.

For example, let's say that we ask Node managers to check the datasets tagged "citizen science". We want:

  1. An easy way for them to see all the citizen science datasets for their node.
  2. If a Node manager noticed a dataset tagged erroneously, we want to keep track of that so that we don't re-tag it next time.
ManonGros commented 3 years ago

Looking at this issue: https://github.com/gbif/portal-feedback/issues/3381, we would be missing the data extracted from taxonomic literature (i.e., Plazi) category. You are right, I missed it!

timrobertson100 commented 3 years ago

Thanks @ManonGros

Looking at this issue: gbif/portal-feedback#3381, we would be missing the data extracted from taxonomic literature (i.e., Plazi) category.

That is what this was intended to be:

Material citations (e.g. taxonomic treatments in literature)

(Related is that Plazi just proposed Material citation an an addition to basisOfRecord vocabulary in the Darwin Core issues for public commentary)

dagendresen commented 3 years ago

+1 @Dmitry for one to many and using keyword tags (instead of a 1:1 core record to category) +1 @Marie for thinking of enabling Node staff to curate categories --> and can also add a feature request for enabling anybody to annotate a datapoint/set with category information (with provenance intact)

Remember also that a "dataset" (as in Darwin-Core-archive-dataset) can be a mixed bag of "evidence records" (aka core record, eg. aka occurrences) of different categories -- if a category "tag" is designed to apply to all core records in a DwC-A

And that the de-normalization of the "evidence records" (core records) means that one cannot be certain of which class that a given property linked to a core record is intended to be linked to

elywallis commented 3 years ago

I really like this idea. Certainly the ALA has users who want a very simple way to select groupings of records across data providers. The group I hear this request from most are curators/researchers who ‘just’ want museum or herbarium specimens.

A couple of suggestions:

  1. Natural history collection - might still be useful to also have a category for Fossil specimens so these can easily be separated out. The reason for separating Fossils out is that subfossils (or any fossil species still extant) often show up outside the extant distribution and can easily be mistaken for errors and flagged as such, when they’re perfectly legitimate.

  2. Single organism sequenced (i.e. tissue from an NHM specimen) Having an additional category for Tissue sample would be very useful, whether sequences have been derived or not. Users of this category might be researchers seeking tissues for loan/destructive sampling who currently have to search BasisOfRecord = material sample plus Preparations pot luck.

  3. Private sector data - do you mean data gathered by companies undertaking environmental impact assessments prior to approval of development/mining projects? If so, in Australia this would commonly be called “Proponent data” (being data from proponents of a development). If Private sector data means something else, perhaps could have both?

timrobertson100 commented 3 years ago

Remember also that a "dataset" (as in Darwin-Core-archive-dataset) can be a mixed bag of "evidence records" (aka core record, eg. aka occurrences) of different categories -- if a category "tag" is designed to apply to all core records in a DwC-A

Thanks, @dagendresen. My thinking here was to try and decouple this from the class/basisOfRecord issues in Darwin Core to be able to react to reporting/user needs quickly (e.g. introduce a new tag for datasets). Acknowledging that there can be "mixed bag" datasets, my intuition is that most users would appreciate broad filtering to e.g. "omit records that originate from datasets tagged as eDNA" even if there were a few entries in there that might be of some interest, or to produce reports (e.g. growth charts) based on e.g. data originating from datasets tagged as private-sector related. Does this seem reasonable, please?

really like this idea

Thanks, @elywallis - I'll add your input to the list at the top now.

Private sector data - do you mean data gathered by companies undertaking environmental impact assessments prior to approval of development/mining projects?

I believe that was the intention, yes. I don't know the details, but I'm aware the data management team is increasingly running reports on trends using categories like this. I'll add your comments in the top list, without proposing a final decision.

timrobertson100 commented 3 years ago

Slightly off-topic, but perhaps useful:

It may not be known to many, but GBIF is progressively moving vocabularies like this into our integrated vocabulary server. This will allow data managers (e.g. including node managers @dagendresen ) to be involved in defining the concepts. Concepts can be hierarchical (e.g. finer categorizations of private data) and once a vocabulary version is released, it is picked up in the data processing pipelines. This is still evolving, but LifeStage is in production now.

What this means relating to this issue, is that as we find new requirements to categorise datasets for a new report or community we see emerging, we'll have the tools in place to accommodate that without needing software developer involvement (only requires a vocabulary to be changed, and then proceed with tagging datasets).

dagendresen commented 3 years ago

"mixed bag" datasets

@timrobertson100 I would (if asked) completely agree that best practice is to avoid "mixed bag" datasets and that a "tag" to enable filter for a "purpose-of-reuse" would be very useful and welcome! And believe we could live well with such functionality not applying 100% to "mixed bag" datasets :-)

(apropos -- GBIF Norway is "negotiating" with Norwegian data publishers to "break" up "mixed bag" datasets into smaller datasets that would be more homogenous)

debpaul commented 3 years ago

@timrobertson100 wrote:

Slightly off-topic, but perhaps useful:

It is may not be known to many, but GBIF is progressively moving vocabularies like this into our integrated vocabulary server. This will allow data managers (e.g. including node managers @dagendresen ) to be involved in defining the concepts. Concepts can be hierarchical (e.g. finer categorizations of private data) and once a vocabulary version is released, it is picked up in the data processing pipelines. This is still evolving, but LifeStage is in production now.

What this means relating to this issue, is that as we find new requirements to categorise datasets for a new report or community we see emerging, we'll have the tools in place to accommodate that without needing software developer involvement (only requires a vocabulary to be changed, and then proceed with tagging datasets).

Tim, can you see my <happy dance!>? At some point, we need something, a talk from GBIF, a TDWG Webinar, about this effort. I think the broader community will find it very enlightening about how we can use the data we have to improve and understand the data.

CecSve commented 2 years ago

13. Machine observation (e.g. camera trap)

Maybe this relates to this category and could potentially be a subcategory, but it would be great to be able to categorize datasets from e.g. drones. Other remote sensing data, e.g. radar, sonar etc. could be subcategories as well. However, drones for example can have subcategories in itself, e.g. UAV, UAS and ROV etc.

To keep it simple, should tracking data perhaps be a subcategory of machine observations?

timrobertson100 commented 2 years ago

should tracking data perhaps be a subcategory of machine observations?

Are catch and release style data (e.g. bird ringing) considered to be "tracking", or identifying an individual by sight (e.g. whale fin)? I genuinely don't know if that is tracking or not, but they wouldn't be machine observation.

ahahn-gbif commented 2 years ago

Alternatively: should we consider a breakdown like this (sub-categories of machine observations, or others) rather as a separate controlled/proposed vocabulary to be used under "methodology"? I do not have a full understanding of user needs here, but there seems to be a difference in purpose between setting simple, intuitive filters ("not eDNA" or "just tracking data"), and the more specialized breakdowns that serve a user being particularly interested in, say, data collected via drones.

In the first case, categorizing at ingestion to serve search filters would be supporting most cases adequately, where more specific queries may be better served by supporting structured keywording of methods used in data collection (including publisher / user guidance on tagging datasets for more detailed methodological approaches).

ahahn-gbif commented 2 years ago

To keep it simple, should tracking data perhaps be a subcategory of machine observations?

The purpose here, if I understand correctly, is to support users to include/exclude particular content, based on how it was derived. In that sense: I would value the fact that some users may want to exclude known, repeated observations / loggings of one and the same individual over time higher than how these data were collected "technically".

CecSve commented 2 years ago

Are catch and release style data (e.g. bird ringing) considered to be "tracking", or identifying an individual by sight (e.g. whale fin)? I genuinely don't know if that is tracking or not, but they wouldn't be machine observation.

True, they would not be machine observations so there would need to be a separation of the two.

Jegelewicz commented 2 years ago

At what point is GBIF diverging from TDWG standards? How can we do things as a community if we are developing vocabularies in silos? How will this fit with LatimerCore and eventually whatever MaterialSample standards come out of TDWG? Sigh.

timrobertson100 commented 2 years ago

I've left a comment on https://github.com/tdwg/material-sample/issues/29 but will also note here.

I'm not sure there is a TDWG standard that would cover this, but terms from various vocabularies could be used (relating to LatimerCore, Darwin Core etc). It's really intended to provide the means to codify datasets to allow easy filtering of data and driving reports on data seen in GBIF. We're asked to report on counts by e.g. private sector data etc which is probably more unique to the GBIF network than the kind of problems TDWGs current task groups cover.

There is of course a large overlap between the GBIF and TDWG communities, and GBIF (staff and network) promotes, implements, and contributes to standards so it could be that one might emerge from this, but it's not immediately obvious.

MattBlissett commented 1 year ago

Also relevant for publishers, e.g. private sector publishers: https://docs.gbif.org/private-sector-data-publishing/2.0/en/#table-01

CecSve commented 7 months ago

I have added the vocabulary now as DatasetCategory on UAT with the following changes:

  1. Citizen science data
  2. Observation data
  3. Natural history collection a. Consider separating out fossils as a separate category, to avoid accidental misuse - added Fossil as a child of NaturalHistoryCollection
  4. Single organism sequenced (i.e. tissue from an NHM specimen) - added Tissue as child of SingleOrganismSequenced a. Consider adding tissue sample as well (which may or may not be sequenced) to aid discovery of preserved tissue without drawing on ambiguous other terms
  5. Environmental DNA and /or metagenomics (e.g. soil sample, water, insect soup etc)
  6. Targeted species detection (PCR-based assays)
  7. Long term monitoring data
  8. Sampling event (where some protocol has been used)
  9. Checklist data
  10. Material citations (e.g. taxonomic treatments in literature)
  11. private sector data - added as BusinessSector instead a. Consider splitting this into finer categories (e.g. proponent data for environmental impact assessment prior to development) versus other categories (to be defined)
  12. Tracking data (i.e. recaptures or GPS tracking of individual organisms)
  13. Machine observation (e.g. camera trap)

I have added comments in brackets as Description, when possible, but several concepts could benefit from a Description and ideally also an External description

tobiasgf commented 7 months ago

The issues name is "Add category to dataset" and the vocabulary is called "DatasetCategory", but as I read it, it is a multi-value field at occurrence level. Maybe we should consider renaming the field and issue to reflect that?

tobiasgf commented 7 months ago

I read it as the main aim is to be able to provide intuitive filters for the users of the data. That is important to keep in mind, so we do not make it over-complicated. I believe Data Products / Helpdesk must have an intuitive feeling (at least) on which types of data data users most often wish to focus on / exclude, and that those categories are the ones now finding their way into the vocabulary. I have some suggestions/comments on those suggested (later...).

tobiasgf commented 7 months ago

private sector serves a user need much like the wish to be able to filter on thematic types of data like fresh water, health, marine this issue, where the wish is to either produce reports/growth charts OR delimit classical data types of e.g. habitat relevance. I believe it is wise to think about these needs in the same work here (not sure of they should be included in the same overall field).

tobiasgf commented 7 months ago

If I understand it correctly, the consensus is that this field (at occurrence level) eventually contains values that are being assigned based on some rules upon ingestion, minimizing the need for manual interaction/curation.

Some thoughts on this:

Should we have a first brainstorm/meeting on how such rules could be - both at a general level, but also checking that we can actually establish some rules for the categories that have been proposed already. And then start designing those rules for real.

Some early thoughts/examples on what might be used for rules:

simple info about known sources, e.g.:

content of selected fields, e.g.:

taxon belongs to a selected checklist

spatial rules

auto-labelling from data formatting tools and similar

Positive/negative lists based on manual curation/refinement (e.g. "no this is not citizen science although the rule suggests so" or "this IS citizen science although the rule suggests it is not")

...?

And combinations of the above, including procedures like the Clustering Algorithm. Simpler rules are of course preferable, and could help refine the categories of the vocabulary?

CecSve commented 7 months ago

The issues name is "Add category to dataset" and the vocabulary is called "DatasetCategory", but as I read it, it is a multi-value field at occurrence level. Maybe we should consider renaming the field and issue to reflect that?

The field will contain information at a record level about how the dataset was compiled so it is pointing to the dataset source in a way. However, most users will not access data on GBIF by downloading specific datasets, but rather query across datasets and this is why the information has to be at record level. The original proposal was to call the field category, but adding dataset qualifies the content of the field more precisely for users.

CecSve commented 7 months ago

private sector serves a user need much like the wish to be able to filter on thematic types of data like fresh water, health, marine this issue, where the wish is to either produce reports/growth charts OR delimit classical data types of e.g. habitat relevance. I believe it is wise to think about these needs in the same work here (not sure of they should be included in the same overall field).

We could maybe have concepts like ThematicAreaFreshwater, ThematicAreaHealth etc., however, this would depend on both the scope and the expansion of thematic areas. Would all freshwater data be part of the freshwater thematic area by default or is it only mobilized data as part of the thematic area that should automatically be mapped to such a concept? If the latter, then I do not think that a controlled vocabulary inclusion would be the most optimal solution.

Also, are the thematic areas filter options more internal relevant or of public relevance? The scope of these categories should be for external end-users, not for internal GBIFS relevance.

CecSve commented 7 months ago

If I understand it correctly, the consensus is that this field (at occurrence level) eventually contains values that are being assigned based on some rules upon ingestion, minimizing the need for manual interaction/curation.

Some thoughts on this:

Should we have a first brainstorm/meeting on how such rules could be - both at a general level, but also checking that we can actually establish some rules for the categories that have been proposed already. And then start designing those rules for real.

Some early thoughts/examples on what might be used for rules:

simple info about known sources, e.g.:

* publisher id: iNaturalist is always citizenScience, NatureMetrics is "Private Sector"

* dataset id: INSDC/ENA is all "DNA"

content of selected fields, e.g.:

* has something in dna-derived extension

* uses eventCore

taxon belongs to a selected checklist

* "all parasites"

* "freshwater species"

spatial rules

* shape file with marine areas

* freshwater

auto-labelling from data formatting tools and similar

* Data coming from the "eDNA tool" is always DNA metabarcoding

* CamtrapDP is Machine observation (or is it?)

Positive/negative lists based on manual curation/refinement (e.g. "no this is not citizen science although the rule suggests so" or "this IS citizen science although the rule suggests it is not")

...?

And combinations of the above, including procedures like the Clustering Algorithm. Simpler rules are of course preferable, and could help refine the categories of the vocabulary?

Should we create a new issue for implementation and automated categorization perhaps @timrobertson100?

tobiasgf commented 7 months ago

The field will contain information at a record level about how the dataset was compiled so it is pointing to the dataset source in a way.

OK, then I did misunderstand. If the values/categories have to be the same across all records in a dataset, then we can of course not use the same approach for "thematic data" which varies within datasets (e.g. rats are health relevant, but not all iNaturalist is health relevant. Brown Trout is fresh water but not all iNaturalist is fresh water, ....). Also ENA/INSDC datasets have a mixture of the categories of DNA-associated data, that would make it difficult to categorize at dataset level.

I understand that most datasets are of a single category, but I am not sure if I understand why the category classification needs to refer to dataset level (again with the user in perspective). Some categories will only be possible to infer (from rules) by looking at the single occurrences anyway.

tobiasgf commented 7 months ago

Would all freshwater data be part of the freshwater thematic area by default or is it only mobilized data as part of the thematic area that should automatically be mapped to such a concept?

Also, are the thematic areas filter options more internal relevant or of public relevance? The scope of these categories should be for external end-users, not for internal GBIFS relevance.

All data yes, and the themes are of user relevance (also/primarily)

tobiasgf commented 7 months ago

Sorry for expanding the issue into the topic on making it operational. As I indicate, the attempt to design the rules may affect the actual delimitation of categories. But no need to mix in same issue, I guess. Sorry.