CatalogueOfLife / general

The Catalogue of Life
49 stars 5 forks source link

Specimen data #36

Closed mdoering closed 6 years ago

mdoering commented 6 years ago

Should the Clearinghouse of CoL+ deal with specimen data, especially types? It will cause significant work to curate and deal with specimen information so it should be clear what is gained.

What additional use cases can we support when having available specimen metadata as opposed to just knowing the basionym/protonym of a name acting as a proxy to the type? Or dynamically link to GBIF for a specimen (with image) search?

Is it important to know the catalogue number, collection, collector, type location or sth else about a type?

mdoering commented 6 years ago

see also #4

rdmpage commented 6 years ago

Given that the whole taxonomic edifice ultimately rests on types (both specimen and taxonomic types) it would seem sensible to include them where available. Leaving them out reduces the value of CoL+ to taxonomists, and reduces the sorts of computation one could do over the names.

Yes, including types will be messy because there’s is poor standardisation of how to refer to specimens, but why not (a) let people enter free-format text strings if they have the information (IPNI has a lot of this for plants), (b) let people add a URL if one exists (knowing that most will likely break), and (c) if they have structured info let them include that as well. Linking to GBIF dynamically could be an additional feature (linking by actual occurrence URL is likely to be fragile, unfortunately).

On 23 Oct 2017, at 12:40, Markus Döring notifications@github.com wrote:

Should the Clearinghouse of CoL+ deal with specimen data, especially types? It will cause significant work to curate and deal with specimen information so it should be clear what is gained.

What additional use cases can we support when having available specimen metadata as opposed to just knowing the basionym/protonym of a name acting as a proxy to the type? Or dynamically link to GBIF for a specimen (with image) search?

Is it important to know the catalogue number, collection, collector, type location or sth else about a type?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Sp2000/colplus/issues/36, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFFasbjiWqCp4ua_TkAOM8HlcLKQ-r-ks5svHszgaJpZM4QCs2X.

dremsen commented 6 years ago

My hope is that type information would be linked to the nomenclatural record and that this record is independent of any taxon record save perhaps the nominal taxon where the type is the circumscription. As Rod asserts, types form the basis of the system. That said, it would be useful to articulate the specific users and use case these data would support and where they sit priority-wise. I would put the literature reference before the specimen data. Without some real structure it is likely any data provided would end up being similar to COL current distribution data. Useful on a per-record basis I imagine.

mdoering commented 6 years ago

yes Dave, I fear the same fate as with the CoL distribution data. And I wonder what the detailed type information really gains over just knowing that a set of names are all based on the same type specimen (which a relation to the same basionym pretty much abstracts). Is it just locating specimens in the real world? Or does the catalogue number or location really matter? I do wonder about the value added

dremsen commented 6 years ago

In this regard then, what process do we have for documenting use cases in regard to user/system requirements, actors and their priorities? Something not too technical but sufficiently structured to facilitate ordering and consistency.

olafbanki commented 6 years ago

Hi Dave, we do not have a fixed process for this. I think we need to work something out.

mdoering commented 6 years ago

I would like sth simple, mostly narrative but with some concrete examples with data. We have contributed to the CSV on web groups uses cases which look reasonable to me: https://www.w3.org/TR/csvw-ucr/#intro

@dremsen How about just a single markdown document in this repo for simple authoring and internal links?

ThierryBourgoin commented 6 years ago

I’m now sure we have to go their. If it would be rather easy to formulate clear and simple standards for taxonomic types (including typification forms), it will be much more difficult for specimen types: recording specimen types without type depository would be meaningless. We still don’t have a standardized list of institutions with stabilized acronyms (their names change most often) and we don’t know how to record in a standardized way private collections housing them.

I’ve started to collect them (not only holotypes): it is really time consuming and often incompletely documented in the publication (e.g. how and how many paratypes are distributed between different collections...). I would rather leave this task to GBIF which collect specimens that can be easily documented with its type status if any.

As CoL deals with taxa, taxonomic types should belong to CoL+ and as GBIF deals with specimens, specimens types should be a task of GBIF.

Another issue is that a specimen type does not necessarily bears the ‘valid’ taxon name...

Le 23 oct. 2017 à 21:16, olafbanki <notifications@github.com mailto:notifications@github.com> a écrit :

Hi Dave, we do not have a fixed process for this. I think we need to work something out.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Sp2000/colplus/issues/36#issuecomment-338655130, or mute the thread https://github.com/notifications/unsubscribe-auth/AbKHG2gd2VnwUP8GeiAVOBzUFtLqWB9aks5svJGjgaJpZM4QCs2X.


Pr Thierry Bourgoin 
 Museum National d'Histoire Naturelle
Tel: +336 7192 7634

Institut Systématique, Evolution, Biodiversité (ISyEB)
UMR 7205 MNHN-CNRS-UPMC-EPHE
Museum National d'Histoire Naturelle, Sorbonne Universités
CP 50, 57 Rue Cuvier, F-75005, Paris, France
Tel:  + 33 / (0) 1 4079 3396
Fax: + 33 / (0) 1 4079 3699

        More about planthoppers? 
    Try FLOW, the Fulgoromorpha database on line:
                     https://hemiptera-databases.com/flow/ <http://hemiptera-databases.com/flow/>

rdmpage commented 6 years ago

I'd argue that if we have the information it seems shame not to make it available, someone might find it useful (e.g., cross checking assertions about type status made by data in GBIF, which are often suspect due to issues in matching taxonomic names).

Linking names via basionym assumes we know which name is the basionym. If we don't it's hard to assert that two names are objective synonyms, whereas it's trivial if we say they both have the same type (even if we don't actually have info on type specimen). In a RDF-world, we could think of type specimens as blank nodes ("bnodes") if we lack data on them, and real nodes if we have data (and ideally an identifier).

Like @dremsen I would prioritise literature, but if people have information on types it seems crazy not to make use of that. In the same way, the current CoL stores and displays literature information, even though it's mostly a shocking mess of inconsistent formats and often fragmentary bibliographic data. But it's still potentially useful to those of us trying to link stuff together.

ThierryBourgoin commented 6 years ago

Agree that it is an essential information to provide; species specimens link the name domain (in GBIF hands) to taxa domain (in CoL+). The question is not to have or not to have it but where if should collected/managed to be used efficently. I suppose that such an info attached to a GBIF specimen would be easily manageable because naturally well/better structured: Specimen Name(s), HT [= bearing type of Taxon T(s) present in CoL+] is present in Collection X (GBIF info). I strongly doubt that we should get such structured info in CoL: Taxon T(s) has (info about its type specimen) would be probably what you will get - no? As much as it has never been asked to GSDs so few of them should have it.

Le 23 oct. 2017 à 23:10, Roderic Page notifications@github.com a écrit :

I'd argue that if we have the information it seems shame not to make it available, someone might find it useful (e.g., cross checking assertions about type status made by data in GBIF, which are often suspect due to issues in matching taxonomic names).

Linking names via basionym assumes we know which name is the basionym. If we don't it's hard to assert that two names are objective synonyms, whereas it's trivial if we say they both have the same type (even if we don't actually have info on type specimen). In a RDF-world, we could think of type specimens as blank nodes ("bnodes") if we lack data on them, and real nodes if we have data (and ideally an identifier).

Like @dremsen https://github.com/dremsen I would prioritise literature, but if people have information on types it seems crazy not to make use of that. In the same way, the current CoL stores and displays literature information, even though it's mostly a shocking mess of inconsistent formats and often fragmentary bibliographic data. But it's still potentially useful to those of us trying to link stuff together.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Sp2000/colplus/issues/36#issuecomment-338691252, or mute the thread https://github.com/notifications/unsubscribe-auth/AbKHG9np2jeduFp-W3B5uStzqPVzMPD7ks5svKxogaJpZM4QCs2X.

dremsen commented 6 years ago

I don't disagree with Rod or Thierry. Specimen data is critical to have and, as factual information, should also be something we accrue with the nomenclatural elements. How granular and parsed we can get is another matter. Is GBIF positioned to identify it and provide sufficient detail to make these linkages? Is typeStatus and scientificName sufficient to make the linkage. And how persistent is access to such a record in a federated environment.

mdoering commented 6 years ago

There are over 3 million types in GBIF. Nearly 1 million are published with an image.

We extract the distinct names every month and remove a very small fraction which appears dirty into this dataset which claims to have 1.222.904 distinct names: https://www.gbif.org/dataset/6cfd67d6-4f9b-400b-8549-1933ac27936f

So there is a large basis to work with. But it needs reviewing. It could be something for the Clearinghouse for nomenclature to take on.

GBIF specimen links are not always stable, so data should better be copied into the CoL+ system - at least once reviewed. I would also pester GBIF more to provide true stable identifiers for at least specimens and only allow logical deletions so you can always resolve ids to the same record again. Or at least provide stable ids for the datasets where this can be done and indicate these clearly

proceps commented 6 years ago

@mdoering I did an experiment, pulling type specimens on the group of my interest (Hemiptera: Auchenorrhyncha). I have a database of names which includes about 50,000 valid names and probably 30,000 synonyms, Plus combinations, misspellings, nomina nuda, etc. Total about 112,000 names. After our discussion, I focused specifically on type specimens only. I was able to pull about 13,500 specimens from GBiF database. I wanted to check how easily I could map those to my database (which I consider relatively clean). I could easily map about 11,500 specimens (not names, the list included holotypes, paratypes, syntypes, etc., there is a lot of repetition). So, I ended with about 2000 names, which I had to resolve manually. There was no any simple algorithmic solution for them. Of those outmatching names, only about 70-80 were actually new name which were not included in my database (mostly published in the last 3 years). About 100 names were manuscript names (somebody put a holotype label on the specimen and never published it). The rest were all different kinds of misspellings, non existing combinations, etc. I know that this supposed to be the cleanest part of GBiF. But it adds about 15% of noise on the top of good names. And there is no any easy way to eliminate them. If I work harder, I could probably reduce it to 10%, but I did not want to allow any chance of introducing errors in relatively clear data set. The question is: what is better 10-15% of noise vs. 0.5% of added value?

mdoering commented 6 years ago

thanks @proceps, thats interesting numbers. I do not propose here to pull all names from GBIF types. The question is rather can be make use of the specimen data and incorporate it for clean names, just like you did for Auchenorrhyncha. In your case GBIF provides types for less than 10% of your names. Well, interesting would be the number of protonyms that can be linked to a specimen, excluding any recombinations. Can you say sth about that?

proceps commented 6 years ago

@mdoering, Incorporated type material in association with names is a great move. I fully support this. The number of the name I could pull from GBiF was surprisingly low. Exact number: 13144 specimens, which I could match to 6672 protonyms. I have data on original combinations and current combinations. I could check on either match. I also have some older combinations. I used those to match Protonym as well. I did not do any fussy match. I do not have exact number of the names which I match directly, but I believe it was about 11,000.

wouteraddink commented 6 years ago

For creating linkages to type specimen and other information it is interesting to look at the TAXREF-LD model to represent nomenclatural and taxonomic information as Linked Data. http://ceur-ws.org/Vol-1933/paper-3.pdf

rdmpage commented 6 years ago

The paper by Michel et al. is nice, but as I pointed out at TDWG 2017 where this work was also presented, this is essentially the same ground covered by @rogerhyam a decade ago, see https://github.com/tdwg/ontology It is depressing that we are still reinventing this stuff a decade after it was sorted out (never mind that we're STILL arguing about names...)

mdoering commented 6 years ago

initially specimen data is not included