CatalogueOfLife / general

The Catalogue of Life
49 stars 5 forks source link

CoL name rendering rules #47

Closed mdoering closed 3 years ago

mdoering commented 6 years ago

How should a scientific name be rendered in the Catalogue of Life? There are different code recommendations for Botanical, Zoological, Bacterial & Virus names. There is also considerable different practise between major zoological groups and data providers.

Document the desirable format and rules that should always be applied for each code. To be considered:

mdoering commented 6 years ago

From Yuri:

Rules for infraspecific taxa in Animalia kingdom should be: 1) all trinomials in accepted names in Animalia should have no infraspecific markers (i.e. all accepted trinomials in Animalia are subspecies and have no infraspecific markers); 2) markers "var." & "f." should appear with trinomials in synonyms if they present in sources; 3) marker "subsp." in synonyms should be eliminated.

yroskov commented 6 years ago

The CoL rules are:

mdoering commented 6 years ago

Thanks @yroskov. With infrageneric names I meant names at an infrageneric rank in the classification or synonymy, not as part of a species name. That I was actually trying to refer to with subgenus classification of species.

In any case I reckon there ain't no infrageneric names up to know in the CoL, but maybe there will be when we better deal with genera and list their synonymy? Often the basionym of a genus is a subgenus or some other infrageneric name.

In regard to the subgenus as part of the species name don't you think it would be much better to remove it in all cases instead of a patchwork of different styles? It also does not really convey much information.

What do you mean by this? A botanical variety does have authors, no?

Botanical type trinomials have no authorstrings

yroskov commented 6 years ago

In regard to the subgenus as part of the species name don't you think it would be much better to remove it in all cases instead of a patchwork of different styles?

If you remove subgenus from the species name, you’ll generate a lot of duplicated names, because almost all GSDs with subgenera in the names are repeating name with/without subgenus among synonyms, for example:

Accepted scientific name:

Synonyms:

Accepted scientific name:

Synonyms:

If CoL decides to remove newly built duplicates, it will require permission from the dataprovider, because this is significant content change.

So, it is not a way which CoL can follow. New infrastructure should accommodate infragenera in binomials, as it is defined by ICZN.

Botanical type trinomials have no authorstrings Type trinomial is a trinomial where infraspecific epithet is identical to species epithet: Trifolium pratense var. pretense. Type trinomial ( “autonym”) is automatically establishes with publication of first recognized infraspecies.

Bot.Code. Art 26.1. The name of any infraspecific taxon that includes the type of the adopted, legitimate name of the species to which it is assigned is to repeat the specific epithet unaltered as its final epithet, not followed by an author citation. Such names are autonyms.

So, a list of accepted varieties for Trifolium pratense should look like that: Trifolium pratense var. pratense Trifolium pratense var. americanum Harz Trifolium pratense var. frigidum Gaudin Trifolium pratense var. parviflorum Bab.

mdoering commented 6 years ago

Ah, I've never heard the term "type trinomial" for autonyms before. Yes, that one is clear!

As for the subgenus in the name and even adding duplicates to CoL I am not convinced at all. ICZN does not seem to encourage that: http://www.nhm.ac.uk/hosted-sites/iczn/code/index.jsp?article=6&nfv=true

yroskov commented 6 years ago

Dear Markus,

This is not about convincing you or me, this is about real data in GSDs and in the CoL. The infrastructure should be able to handle them, and user interface should provide search by binomial, where subgenus is optional. As it is now in the CoL: search for Wesmaelius quadrifasciatus bring back Wesmaelius quadrifasciatus, Wesmaelius (Kimminsia) quadrifasciatus and Wesmaelius (Wesmaelius) quadrifasciatus:

http://www.catalogueoflife.org/col/search/all/key/Wesmaelius+quadrifasciatus+/fossil/1/match/1

mdoering commented 6 years ago

@yroskov I am questioning the usefulness of having these duplicate names in the CoL. As a user I personally find this awkward. The CoL assembly removes all kind of names, especially duplicates or manually suppresses or changes them. So it is perfectly in line with how the CoL works to discuss whether it is useful to have the same binomial with the same authorship - the same name - given twice just in 2 different subgenera. If the GSDs have it or not is not the question; it is about the standards that the CoL wants to enforce as the publisher.

@dremsen @ThierryBourgoin has this ever been discussed in the Global Team or Taxonomic Group?

yroskov commented 6 years ago

Markus,

It would be nice, if CoL+ will not try to change CoL rules, but as a first step deliver IT infrastructure which help our team to do our assembly and publishing routines.

If taxonomists keep both Wesmaelius quadrifasciatus and Wesmaelius (Kimminsia) quadrifasciatus in their databases, they have a reason for this. If IT infrastructure pretends to be useful, it should provide a space for this, but not dictate taxonomists what to do.

has this ever been discussed in the Global Team or Taxonomic Group?

Subgenera in the CoL were discussed by wide range of GSDs and by Taxonomy Group in 4D4Life project (2009-2012). Proposal was adopted by Global Team and couple years later implemented by IT developers. A new software and content with subgenera have been deployed in production on 23rd June 2015 (http://www.catalogueoflife.org/col/info/special ).

mdoering commented 6 years ago

Yuri, as I said I question the usefullness. The systems do support interpolated subgenus names and you can have as many for the same binomial as you want in the CoL. But as a user of the CoL I find that unexpected.

The provisional catalogue and names index will definitely merge these variations of the same binomial into one Name instance with a single id. Here it is essential that we can automatically identify the same name in all its variations and not create tons of duplicates.

yroskov commented 6 years ago

OK.

Just put yourself in non-taxonomist user shoes (first year university student, or ecologist, or EEA / IUCN officer, or librarian): does the name “Wesmaelius (Kimminsia) quadrifasciatus (Reuter, 1894)” from the paper or report refer to the same species as “Wesmaelius quadrifasciatus (Reuter, 1894)”? does name “Wesmaelius (Wesmaelius) quadrifasciatus (Reuter, 1894)” has any relations with “Wesmaelius (Kimminsia) quadrifasciatus (Reuter, 1894)”?

CoL gives a simple answer: http://www.catalogueoflife.org/col/search/all/key/Wesmaelius+quadrifasciatus/fossil/1/match/1

If you remove subgenus from the name, CoL will not be able to give such answer.

Subgenus in species name (as well as all kind of combinations in use) is essential from user perspective.

mdoering commented 6 years ago

I can see that it is useful if you have tracked all different subgeneric classifications for all names at least in a given group so one can indeed simply do a dumm string comparison. Otherwise it will even raise more questions how your name matches. With name matching tools and an introduction how the CoL treats names that at least people not familiar with latin names should read, I don't think it will be an issue if the subgenus is missing at least from synonyms.

The different variations of the same binomial are actually very similar to chresonyms which we also remove. Or "quadrinomials", e.g. a variety given with its subspecific classification. These should be removed, correct? They are alternative classifications of the exact same bi/trinomial. It is a slippery slope. It really belongs to the question of what is a unique name, covered in #35. A joy for endless discussions.

yroskov commented 6 years ago

Dear Markus,

Let me repeat please, ICZN names with subgenera should be present in the CoL production database and visible through user interface.

Quadrinomial is another issue. CoL is primarily focusing on species. It was designed as a species index rom its beginning. Assembly database does allow only a single rank below species - if you like, some kind of glitch in the structure.

mdoering commented 6 years ago

Subgenera understood, we leave that entirely up to the authors and CoL editor (not in the pCat though as mentioned).

Quadrinomials is an interesting subject though, you might get them through Darwin Core, e.g in WoRMS. I haven't checked if that actually happens, but we should better be prepared.

Currently we cannot cater for quadrinomials and I would very much like to keep it that way. We can parse them, but the intermediate rank, e.g. subspecies is simply ignored.

dremsen commented 6 years ago

I agree with Yuri on the overall issue, particularly his presentation of non-taxonomist users who may discover these names in literature, specimens, etc. and ask how this name relates to the COL name linked to a taxon. These are all objectively related names that can be abstracted from any particular taxonomic view and should be linked via a separate nomenclatural reference system. Then, the COL or any other can selectively decide to display or not display depending on their user needs.

yroskov commented 6 years ago

Quadrinomials.

I have played with them in different ways since 2005:

1) Cut intermediate rank. But this creates new entities, which may significantly damage taxonomic concept presented in source database. I have rejected this approach.

2) Cut third epithet and accumulate additive information (synonymy, distribution, references, etc.) under parent trinomial. This would be the most intelligent approach, but CoL is not capable to do it a persistent way.

3) Leave entire record (i.e. quadrinomial taxon) outside the CoL. We have accepted this way.

yroskov commented 6 years ago

In ideal world, as much names CoL indexes in the species concept is better. As much objective synonyms CoL can absorb is better. We just need to encourage the taxonomic expert to review and adopt names accumulated in a nomenclatural cluster. It just may have a cost.

mjy commented 6 years ago

I am somewhat suprised that the argument for trinomials is accepted, but yet quadranomials is somehow different?

Isn't the exact same use case between the two, i.e.:

As a non-taxonomist I see the name 'Aus (Bus) cus dus' in the literature, and I want to see what species it might refer to.

Does this mean I can not resolve a quadranomial it in CoL?

I'm (very) likely missing something, or reading this out of context.

On Mon, Oct 1, 2018 at 8:57 AM yroskov notifications@github.com wrote:

In ideal world, as much names CoL indexes in the species concept is better. As much objective synonyms CoL can absorb is better. We just need to encourage the taxonomic expert to review and adopt names accumulated in a nomenclatural cluster. It just may have a cost.

From: David Remsen notifications@github.com Sent: Monday, October 01, 2018 8:47 AM To: Sp2000/colplus colplus@noreply.github.com Cc: Roskov, Yury yroskov@illinois.edu; Mention < mention@noreply.github.com> Subject: Re: [Sp2000/colplus] CoL name rendering rules (#47)

I agree with Yuri on the overall issue, particularly his presentation of non-taxonomist users who may discover these names in literature, specimens, etc. and ask how this name relates to the COL name linked to a taxon. These are all objectively related names that can be abstracted from any particular taxonomic view and should be linked via a separate nomenclatural reference system. Then, the COL or any other can selectively decide to display or not display depending on their user needs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://github.com/Sp2000/colplus/issues/47#issuecomment-425913393>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AbQCLJjMtRYCEM_AKpwPTvjC9DojJA1cks5ughzVgaJpZM4W-HDa>.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Sp2000/colplus/issues/47#issuecomment-425916618, or mute the thread https://github.com/notifications/unsubscribe-auth/AABYSMEP51wSP4QuSE7ZRFYyTOlze7IJks5ugh82gaJpZM4W-HDa .

dremsen commented 6 years ago

Matt - That's always been the primary use case IMO and one that, if we get right, has a substantial user market.

yroskov commented 6 years ago

As a non-taxonomist I see the name 'Aus (Bus) cus dus' in the literature, and I want to see what species it might refer to.

This is a case of trinomial, and user will see exactly this view: http://www.catalogueoflife.org/col/search/all/key/Chrysopa+prasina+zelleri/fossil/1/match/1

mjy commented 6 years ago

But that says "Results for "Chrysopa prasina zelleri" <- a trinomial.

I tried Results for "Chrysopa (Maculatae) prasina zelleri" ... and it resolves as (I) expected.

Please forget my concern, like I said it was out of context.

On Mon, Oct 1, 2018 at 9:13 AM yroskov notifications@github.com wrote:

As a non-taxonomist I see the name 'Aus (Bus) cus dus' in the literature, and I want to see what species it might refer to.

This is a case of trinomial, and user will see exactly this view:

http://www.catalogueoflife.org/col/search/all/key/Chrysopa+prasina+zelleri/fossil/1/match/1

Yuri

From: Matt notifications@github.com Sent: Monday, October 01, 2018 9:06 AM To: Sp2000/colplus colplus@noreply.github.com Cc: Roskov, Yury yroskov@illinois.edu; Mention < mention@noreply.github.com> Subject: Re: [Sp2000/colplus] CoL name rendering rules (#47)

I am somewhat suprised that the argument for trinomials is accepted, but yet quadranomials is somehow different?

Isn't the exact same use case between the two, i.e.:

As a non-taxonomist I see the name 'Aus (Bus) cus dus' in the literature, and I want to see what species it might refer to.

Does this mean I can not resolve a quadranomial it in CoL?

I'm (very) likely missing something, or reading this out of context.

On Mon, Oct 1, 2018 at 8:57 AM yroskov <notifications@github.com<mailto: notifications@github.com>> wrote:

In ideal world, as much names CoL indexes in the species concept is better. As much objective synonyms CoL can absorb is better. We just need to encourage the taxonomic expert to review and adopt names accumulated in a nomenclatural cluster. It just may have a cost.

From: David Remsen <notifications@github.com<mailto: notifications@github.com>> Sent: Monday, October 01, 2018 8:47 AM To: Sp2000/colplus <colplus@noreply.github.com<mailto: colplus@noreply.github.com>> Cc: Roskov, Yury yroskov@illinois.edu<mailto:yroskov@illinois.edu>; Mention < mention@noreply.github.commailto:mention@noreply.github.com> Subject: Re: [Sp2000/colplus] CoL name rendering rules (#47)

I agree with Yuri on the overall issue, particularly his presentation of non-taxonomist users who may discover these names in literature, specimens, etc. and ask how this name relates to the COL name linked to a taxon. These are all objectively related names that can be abstracted from any particular taxonomic view and should be linked via a separate nomenclatural reference system. Then, the COL or any other can selectively decide to display or not display depending on their user needs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://github.com/Sp2000/colplus/issues/47#issuecomment-425913393>, or mute the thread<

https://github.com/notifications/unsubscribe-auth/AbQCLJjMtRYCEM_AKpwPTvjC9DojJA1cks5ughzVgaJpZM4W-HDa>.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Sp2000/colplus/issues/47#issuecomment-425916618, or mute the thread < https://github.com/notifications/unsubscribe-auth/AABYSMEP51wSP4QuSE7ZRFYyTOlze7IJks5ugh82gaJpZM4W-HDa>

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://github.com/Sp2000/colplus/issues/47#issuecomment-425920193>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AbQCLDXrRkJIuF2MKWCR3CTNU421-7aoks5ugiE-gaJpZM4W-HDa>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Sp2000/colplus/issues/47#issuecomment-425923018, or mute the thread https://github.com/notifications/unsubscribe-auth/AABYSAfiYKuswyqTqOxy16SsKPzN9pcMks5ugiL4gaJpZM4W-HDa .

mdoering commented 6 years ago

Should not querying the CoL and getting good results be the focus? Rather than returning results that include the exact name string / orthographical version in the synonymy so users with no knowledge can see they match? The search/matching system should help exactly those unexperienced users.

As long as the search does a good job in understanding names and returns good matches we do not need a larger synonymy. A larger synonymy is useful when we cannot automatically derive the answer, e.g. listing all generic recombinations. Otherwise the CoL would need to index all string variations we can think of - which clearly is out of scope and not what anyone ever wanted to achieve.

mjy commented 6 years ago

It does strike me that one way to come to an agreement is to outline a specific set of unit tests (sensu code), with inputs and outputs. When a) the CoL team agrees to those use cases and b) they are implemented then the project is "done". Quickly this will illustrate to all parties that compromises are required.

M On Mon, Oct 1, 2018 at 9:58 AM Markus Döring notifications@github.com wrote:

Should not querying the CoL and getting good results be the focus? Rather than returning results that include the exact name string / orthographical version in the synonymy so users with no knowledge can see they match? The search/matching system should help exactly those unexperienced users.

As long as the search does a good job in understanding names and returns good matches we do not need a larger synonymy. A larger synonymy is useful when we cannot automatically derive the answer, e.g. listing all generic recombinations. Otherwise the CoL would need to index all string variations we can think of - which clearly is out of scope and not what anyone ever wanted to achieve.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

yroskov commented 6 years ago

Two different issues:

1) Fuzzy Search is very important. It was completely missed in the CoL interface, despite of kind offers for implementation codes from Heimo Rainer and Anton Güntsch.

2) Scope of any IT protocols for identification of relationships between various forms of names is limited. There are too many deviations in taxonomic practices. If the name string is not documented with publication, or at least “stamped” by the expert (in this sense, GSD is a publication as well), I would not trust query results.

mdoering commented 6 years ago

Yuri, but 1) already violates 2). A fuzzy search result per nature returns something close, not exact as requested. Same goes for a little loose author comparison. Or interpreting subsp and ssp or forma, form or f. There are many ways in which latin name string provide variations which we can deal with without knowing all variations in advance. I am not suggesting here to be clever and derive name relations. It's purely name parsing and very conservative fuzzy matching restricted to variations we know frequently happen with latin names. Not just any generic fuzziness which often yields wrong matches.

And I like @mjy idea to list test cases. @yroskov would you like to compile some?