Dataset types - Githubissues

mdoering commented 5 years ago

Datasets differ in their focus and scope. The dataset metadata therefore offers currently a single vocabulary to classify them: https://github.com/Sp2000/colplus-backend/blob/master/colplus-api/src/main/java/org/col/api/vocab/DatasetType.java#L6

The dataset type can be important for human users to understand the scope of the dataset. It is also important to process data correctly, especially when using the data to assemble merged catalogues such as the CoL and the provisional catalogue. For the assembly purpose it is vital to know if a list of names are just names or indeed scrutinized taxa. By separating names from taxa in the ColDP format this is already apparent. Data published via DwC-A or ACEF does not have that distinction though so it is vital for those to know whether a dataset should be treated just as names or also as taxa.

The IPT BestPracticesChecklists resource already provides a good list of common dataset types: https://github.com/gbif/ipt/wiki/BestPracticesChecklists#scope

Should we simply adopt it? Is there any value in breaking down the dataset type classification into multiple properties? geographicScope could be an obvious candidate (none, regional, country, global)

mdoering commented 5 years ago

Another dataset type worthwhile adding are the datasets provided by Plazi which each resemble a published paper. Maybe add sth like Article?

kcopas commented 5 years ago

fwiw, I think Plazi consistently refers to those as 'taxonomic treatments'.

mdoering commented 5 years ago

A taxonomic treatment is a single taxon in such a publication. A Plazi dataset bundles all treatments from the same publication. Not sure whats the best terminology for the publication, but Plazi refers to them as Articles on their home page: http://plazi.org/

From http://plazi.org/api-tools/api/#What_is_a_treatment:

Treatments are well defined parts of articles that define the particular usage of a scientific name by an author at a given time (the publication) ... There is one archive per article stored in Plazi, containing the data from all the treatments in the article.

But on an article page they have publication ID not article: http://treatment.plazi.org/GgServer/summary/7B4CE9FDCC96F4BF0D2A5AB45EAECF74

Seems to me either Article or (Academic ) Publication

kcopas commented 5 years ago

I humbly stand corrected…

yroskov commented 5 years ago

In the CoL we are using following classes for evaluation of available resources:

Global vs Regional (specify a region: N America, Germany, Thuringia, Western Caucasus, etc.). Taxonomic vs Nomenclatural vs Thematic (i.e. resources with trustable taxa/name lists, but limited in their main subject areas; specify a subject area: marine, freshwater; list of type specimens; red list; medicine use; prohibited organisms in New Zealand, etc.) Paper published (specify: paper in journal, book, etc.) vs Digital resource (specify: database, web HTML, PDF, etc.) Taxonomically Complete vs Incomplete (specify completeness: 10% 45% 99%) and Checklist Confidence (quality): Low, Moderate, High

Term “taxonomic treatment” cannot help much because it needs further itemization. What would be alternatives to “taxonomic treatment”? “discussion paper”? “non-taxonomic publication”? “identification key”, “checklist”? what?

Unfortunately, Plazi’s definition does not accurately match meaning of the term used by taxonomists in last 2 centuries. A “taxonomic treatment” is a critical review of the taxon in global (in other words “monographic treatment”) or regional scope. For example, “Flora of North America’ includes taxonomic treatments for all vascular plants in the USA & Canada; monograph “The Genus Trifolium” includes taxonomic treatment for all species in one genus globally. Taxonomic treatment might be published in a traditional media of printed books/journals or electronically (FNA example, http://dev.semanticfna.org/wiki/Carex_laevivaginata). Term “taxonomic treatment” also covers unpublished datasets on working desks of taxonomists inside the project. Standard dataset of “taxonomic treatment” depends on project goals and rules stated for authors and editors. “Taxonomic checklist” (i.e. name index to names in current use and previously used names) is indispensable part of a “treatment”. Usually, treatments also includes identification keys, classification above and inside species, nomenclatural and taxonomic comments, brief review of major taxonomic publications, indication of the types, morphological descriptions, phenological and ecological information, generalized distribution, common names, uses, etc. etc. I never heard about application of the term to a paper on a single species.

mdoering commented 5 years ago

@yroskov I think there's a misunderstanding. A Plazi dataset in GBIF represents an entire paper/article which usually contains multiple taxonomic treatments. @myrmoteras do you think "article" is a good term to refer to these "papers" regardless if its in a journal?

myrmoteras commented 5 years ago

a taxonomic treatment: publications or (more frequently) sections of publications [= article] documenting the features or distribution of a related group of organisms (called a “taxon”, plural “taxa”) in ways adhering to highly formalized conventions. (Catapano, 2010)

Each of Linneaus Binomen is part of a taxonomic treatment. The Plazi dataset "Linnaeus 1758" would be the entire Systema Naturae.

As an example: here is an article, here the treatment centric view at Plazi's TreatmentBank, and here how it looks at GBIF. Each of the accepted names (lower left in the dashaboard summary on top) has its taxonomic treatment. In a different view at GBIF, all the holotype mentioned in a treatment can be listed.

In an ideal world, for each treatment for a new species the holotype could not only be listed but also linked so that everybody could immediately look at it (see example, from this treatment from this article)

It is rather the opposit that what @yroskov states: The Codes require a taxonomic treatment that has to include a diagnosis and a designation of a holotype and its collection where it is deposited - the tradition that Linnaeus started and is ignored by the cataloguers.

yroskov commented 5 years ago

The Codes require a taxonomic treatment that has to include a diagnosis and a designation of a holotype and its collection where it is deposited - the tradition that Linnaeus started and is ignored by the cataloguers.

Let me please disambiguate this statement (not mine). Mentioned Code’s requirement is valid for prototologue (in its modern sense) only. While, the editorial committee of the project defines different standard dataset for own taxonomic treatments.

Linnaeus’ Species Plantarum and Systema Naturae very rarely listed types. Sentences like “Habitat in Asia” cannot be interpreted as a citation of holotype and place of its deposit. Majority of so called “Linnaeus types” are lectotypes, validated much later by other authors.

Each of Linneaus Binomen is part of a taxonomic treatment.

Nonsense. Any kind of publication may include scientific names. There are many botanical Latin names were first published in horticultural seed lists. Later, they were validated according relevant versions of the Code. Taxonomic treatment may contain binomial as well as uninominal scientific names.

yroskov commented 5 years ago

At least in botanical practice, “treatment” (as a “set”, math.) defines by a fixed list of assigned authors.

For example, the treatment of the genus Chrysosplenium was published by C.C. Freeman & N.D. Levsen in vol. 8, Flora of North America. This treatment includes six N American species of the genus.

Could you please translate this example in Plazi/GBIF language?

myrmoteras commented 5 years ago

@yroskov can you send me the copy of the flora the treatment including the genus Chrysosplenium was published by C.C. Freeman & N.D. Levsen in vol. 8, Flora of North America? I will convert it.

myrmoteras commented 5 years ago

here is a Linnaeus 1758 treatment for Formica obsoleta

at that time, it was clear that this refers to specimen(s) in Linnaeus' collection that, with growing insights, became urgent to lectotypify and to designate holotypes.

yroskov commented 5 years ago

In this example, I can see the first publication of the species, but not a treatment.

Systema Naturae contains taxonomic treatments of the genus Formica. We may say, it contains treatments for Hymenoptera or Insecta, but application of the term “taxonomic treatment” to a single species in Systema Naturae (a monograph published by a single author in 1758) looks very “innovative”.

yroskov commented 5 years ago

About “Linnaeus types”. Linnaeus was famous for replacement of specimens in his collections with better new samples through whole life (at least, it’s true for his herbarium). Often, he destroyed old specimens. Lectotypification of his species requires a lot of work with his mail archives and his biography. Nothing clear is there (at least, with plants). Linnaean Plant Name Typification Project has taken over 25 years of intensive work http://www.nhm.ac.uk/our-science/data/linnaean-typification/index.html.

myrmoteras commented 5 years ago

again. treatment sensu plazi is defined here: a taxonomic treatment: publications or (more frequently) sections of publications [= article] documenting the features or distribution of a related group of organisms (called a “taxon”, plural “taxa”) in ways adhering to highly formalized conventions. (Catapano, 2010https://www.ncbi.nlm.nih.gov/books/NBK47081/)

This definition is modeled and used in Taxpub/JATS (http://htmlpreview.github.io/?https://github.com/tcatapano/TaxPub/blob/master/documentation/tp-taxon-treatment.html), which is the XML bases for publishing semantically enhanced taxonomic articles, among others by the Pensoft journals Zookeys, Biodiversity Data Journal (see eg https://zookeys.pensoft.net/article/29738/download/xml/ in this article https://zookeys.pensoft.net/article/29738/

In some monographs, like this https://zookeys.pensoft.net/article/29738/list/7/ there can be over hundred treatments, or this https://checklist.pensoft.net/articles.php?id=21049 which includes 225 treatments here listed in TreatmentBanks overview format http://treatment.plazi.org/GgServer/summary/FFA6FFD6FD73FFC3874EFFD8FFE56E14 for which in many cases no names are in GBIF and COL+ (53% overlap), as shown here in the GBIF view of the article https://www.gbif.org/dataset/243064a9-76d7-48c8-b378-31b5aacf9ad7. But all this names have a link to the treatment and from there to the publication. If there are figures cited in the treatment, then they are also linked https://www.gbif.org/species/149668646

A scientist thus does not have to painstakingly search for the articles referenced by a name, but in the ideal case has direct access to the treatment, the type or all the materials examined. with iDigBio and DiSSCo in place we'll have millions of specimens we thus could cite.

At TreatmentBank we have over 240.000 treatments, for 2017 we extracted over 30% of the treatments of the expected ca 17.000 new animal species and we are going to expand this. This is complemented by the extraction of related figures from publications deposited at BLR (https://zenodo.org/communities/biosyslit/search?page&size=20).

mdoering commented 5 years ago

Hi @yroskov I would like to get back to this issue on how to classify the scope of datasets. I mostly agree with your CoL assessment above that there are various properties defining the overall scope:

geographic: Global vs Regional (specify a region: N America, Germany, Thuringia, Western Caucasus, etc.).
type: Taxonomic vs Nomenclatural vs Thematic (i.e. resources with trustable taxa/name lists, but limited in their main subject areas; specify a subject area: marine, freshwater; list of type specimens; red list; medicine use; prohibited organisms in New Zealand, etc.)
taxonomic: which groups are tackled - can be derived from the actual data

Then there are these which are more ratings or metrics about the dataset

completeness: 0-100% complete according to all scopes above
confidence: Checklist quality Low, Moderate, High

The DatasetType vocabulary currently mixes at least geographic with subject:

NOMENCLATURAL
GLOBAL: A taxonomic checklist with global coverage, a global species database (GSD).
REGIONAL: A regional or national checklist.
ARTICLE: A dataset representing taxonomic treatments of a single scientific article. Mostly published through Plazi or Pensoft at this stage.
PERSONAL: A list of names uploaded for personal use without a more specific scope given
OTU: A taxonomic checklist focussed on providing OTU identifier backed by sequences, usually mixed with classic Linnean classifications.
- OTHER

If we split off the geographic scope and make that a new property:

GLOBAL
NATIONAL: purely national lists
REGIONAL: any other regional scope

We could use the following dataset type vocabulary:

NOMENCLATURAL: A dataset focussing purely on names not their classification.
TAXONOMIC: A taxonomic checklist with assertions about the validity of names, synonymy and their classification.
ARTICLE: A dataset representing taxonomic treatments of a single scientific article. Mostly published through Plazi or Pensoft at this stage. A subset of taxonomic.
OTU: A taxonomic checklist focussed on providing OTU identifier backed by sequences, usually mixed with classic Linnean classifications. A subset of taxonomic.
THEMATIC: Thematic lists focussing on a specific theme, e.g. invasive species, medical plants, etc
PERSONAL: A list of names uploaded for personal use without a more specific scope given
- OTHER

Knowing a distinct OTU and ARTICLE type helps processing that data more accurately.

yroskov commented 5 years ago

Go back to my statements above. (I am talking only about different “classes” of metadata which are important for management of the CoL). There are six classes:

Type of Data: “Taxonomic” vs “Nomenclatural” vs “Thematic”. [“Article” and “OTU” are children subsets of “Taxonomic” in the sense of CoL, but not separate entities in Type_of_Data. In CoL we have used term “Rich Data”. Rich Data contains subsets: description, identification key, illustration, map, sequences, etc. For example, “Rich Data (identification key, description, habitat, lifespan, biology)”.
Geographic Coverage: "Global" vs "Regional" ["National" is a subset of "Regional". The word "national" brings unwanted political odor, and I am against it. "Regional (Brazil)", "Regional (Europe)", "Regional (Bavarian Alps)" says all what users and data managers need to know].
Taxonomic Completeness (values 1 - 100%).
Checklist Confidence (values from 1 to 5 stars/points) [we should avoid words such as “low quality”, which may sound offensive towards data providers]
Media: “On paper” vs “Digital, not-normalized data” vs “Digital, normalized data”.
Availability: “Public” vs “Private” vs “Restricted Dissemination”

mdoering commented 5 years ago

I would especially like to have ARTICLE as a distinct type as it allows us to better deal with the datasets. I am much more thinking about machine use of the data than humans and it is important we understand we are dealing with a marked up scientific, taxonomic article.

Similar knowing about national lists is often really useful as they are a) funding bodies b) classic sources of fauna and floras and c) legislative "areas" that makes them special from a region of arbitrary limits.

Media I am not convinced is really useful, but wont hurt if someone really manages this.

Availability I think should never be allowed to be private or restricted. See licensing.

yroskov commented 5 years ago

Markus, I have formalised categotries important for CoL management. You may have different opinion for servide another projects/products. This is OK. However, your vision may become a real barrier for managing of the CoL.

myrmoteras commented 5 years ago

Let's consider, what's in a taxonomic name, e.g. Formica rufa L. 1758?

For me as a trained taxonomist, I understand most of it. 1758 refers to the year of publishing. L. to Carolus Linnaeus. L. 1758 is a citation of Systema Naturae ed. 10. rufa is the species epithet Formica the genus name Formica rufa L. 1758 is a citation of a bit of text in L.1758, where this name has been made available. Formica rufa refers to what taxonomists call a species out there in the wild.

For a taxonomist, the problem is obvious, that these highly implicit citation need a lot of experience to find the original description if it is not just one of the main works.

For a machine, this is completely unresolvable, unless there is a an element that specifies that part of it as a citation.

In today's digital world, articles have a unique, globally accepted identifier, that allows to find the article with one to a couple of mouse klicks. It is the source and at the base to make a taxonomic name available.

To make a taxonomic name available, i.e. to name a new species, the CODES describe what needs be done. It has to be formally published It needs a description and or statement why this is different from others, the blurb of text that a taxonomist wants to read when trying to understand a species. It needs to have a holotype designated It needs to have a citation of the collection, where the type is located.

Whenever you want to make an entry into the catalogue of life, somebody has to check this. It is a standard scientific practice to cite the source, e.g. the respective publication should be cited in the COL.

At the same time, the blurb of text about the referenced taxon is one of probably hundred millions available. Each new species has one, each redscription, nomenclatural changes has one. This might be called snipet, or as we do, taxonomic treatments. There are over 280,000 of them in TreatmentBank, they are supplementary materials in PubMed Central, automatically extracted from publication based on taxpub XML. Analogous to articles citing other articles, snipets or treatments cite other treatments, with the additionl that ciations can represent a synonymy or other nomenclatural act.

All the "rich data" you use is part of a treatment. Anything in these treatments.

All these data above is included as semantic tags in an increasing number of publication and supported by Pensoft and CETAF/EJT and can be extracted by machine and through this process, a name can be followed to its treatment, its article, its figures, its specimen. Eventually this can happen also on the same day a species is described. You can explore this on https://www.gbif.org/dataset/5dfd4ff4-6c8b-47f4-9aa7-0f6b1b2af967 Where you can see the holotype https://www.gbif.org/occurrence/search?dataset_key=fd8ba24f-4d68-431c-81f5-e924906aed5d&type_status=HOLOTYPE

To produce a COL that makes use of todays technical possiblities, it seems to be obvious to include all the above data types, and furthermore consider using, and if not avaialble, creating identifiers that can be used to cite and link to COL and thus make it a central player in todays rapidely growing knowledge infrastrucure, both in taxonomy and well beyond.

CatalogueOfLife / general

Dataset types #50