gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

Publish a literature-based checklist+occurrence dataset #119

Open dagendresen opened 1 year ago

dagendresen commented 1 year ago

Advanced question from Iryna. Maybe look at how Plazi publishes? Maybe looks at the new Checklistbank?

We are going to publish a literature-based checklist+occurrence dataset. In DwC terms it will be Taxon core + Occurrence extension+References extension, is this correct? My questions are these: 1) Is the Literature reference extension still valid? I see here https://rs.gbif.org/extension/gbif/1.0/references.xml the field "identifier" - is this the field for the reference ID? It looks a bit weird. Or is it better to cite bibliography in other way? 2) Considering taxonomy. Let's say we have in the literature Metatrichia horrida, and the currently accepted name is Metatrichia vesparia. We want to keep both names in the dataset. Should we put the original name, as provided in the source, in the Occurrence dataset or in the Taxon core? Which DwC terms should we use for the original taxon name (as in the cited literature, not basyonim etc) and current taxon name? It looks like some obvious stuff but I acknowledge myself confused. Best regards, Iryna

rukayaj commented 1 year ago

This kind of reminds me a bit of the Nordic plant uses dataset (https://ipt.gbif.no/resource?r=nhm-plant-uses). That uses the Literature reference extension like this:

image

For the taxonomy, I don't remember the details exactly but it looks like we used https://dwc.tdwg.org/terms/#dwc:acceptedNameUsage for the current accepted name, and scientificName for the name in the literature. The mapping looks like this:

image

rukayaj commented 1 year ago

We are going to publish a literature-based checklist+occurrence dataset. In DwC terms it will be Taxon core + Occurrence extension+References extension

Doesn't it perhaps make more sense to publish this as 2 datasets? 1 checklist, and 1 occurrence?

IrynaYa commented 1 year ago

Hi Dag, hi Rukaya, Thanks for your help! The taxonomy mapping now is clear - ScientificName for cited names, AcceptedNameUsage for current names. The scheme I was thinking about looks like that on the picture. It is a bit different from datasets published by Plazi - they are mostly records associated with one publication. Considering references I see no point in splitting references page by page, as it is done in the example with Nordic Plant, we don't have resources for this work. I just want to give a reference for each occurrence. If I publish 2 separate datasets - Checklist and Ocurrence as Rukaya suggested - can I somehow keep the connection between them by taxonID? This should be explained in metadata then? Schema

dagendresen commented 1 year ago

can I somehow keep the connection between them by taxonID?

Could you make globally unique taxonIDs? Such as generating a urn:uuid:UUID and reusing the same taxonID in both datasets.

dagendresen commented 1 year ago

Same for occurrenceID, try to generate a globally unique identifier -- and avoid composite identifiers (where you use the taxonID as part of the occurrenceID string) ;-)

rukayaj commented 1 year ago

Yes, I would use UUIDs (see https://www.uuidgenerator.net/ to bulk generate them) and then explain in the metadata of both datasets that they are related datasets and they complement each other.

IrynaYa commented 1 year ago

Could you make globally unique taxonIDs? Such as generating a urn:uuid:UUID and reusing the same taxonID in both datasets.

Yes, sure. I was going to generate them via UUID.

IrynaYa commented 1 year ago

Thank you for explainations!

JuliianaLeshchenko commented 1 year ago

Dear @dagendresen and @rukayaj I have several questions about the dataset from the above photo from Iryna. I'm trying to find a balance between a machine-readable and a human-readable dataset. Concerning Occurrence Extention:

Regarding TaxonID. Could this be the GBIF ID? In my dataset, the species will be repeated, can I put in this column, for example, https://www.gbif.org/species/212 for each species its own id? Regarding book ID. Could it be surname and year of publication, for example, Lavitska-1949, or a DOI for new publications? Books, of course, will also be repeated in the dataset. Thank you for your feedback!

rukayaj commented 1 year ago

Hi @JuliianaLeshchenko, for the taxonID you could indeed use any of the taxon identifiers that you can see at the bottom of the page on e.g. https://www.gbif.org/species/4352338. You could also use https://www.uuidgenerator.net/ to generate v4 uuids. It doesn't matter really as long as it's unique. You can see in the standards documentation https://dwc.tdwg.org/terms/#dwc:taxonID it just says it has to be something which at minimum is unique within the dataset.

For 'identifier' in the literature, it says this should be the ISBN or DOI or whatever. You can read more about the literature extension here https://rs.gbif.org/extension/gbif/1.0/references.xml. In other words in the last table in Iryna's image, 'id' and 'identifier' can be the same column. You would need a link back to the core table (so either taxonID or occurrenceID).

We suggest you publish two datasets, one for the occurrences and one for the taxon checklist.

rukayaj commented 1 year ago

If you want some one on one help then I or @MichalTorma can do a zoom call with you, or you can paste the data here and I will show you how I would format it.

dagendresen commented 1 year ago

Reusing the Catalog of Life LSID or the GBIF taxonKey as the taxonID in your dataset is much much better than creating a new UUID.