CatalogueOfLife / general

The Catalogue of Life
49 stars 5 forks source link

Define objective rules for taxon concept identity #6

Open mdoering opened 7 years ago

mdoering commented 7 years ago

Define rules for a stable taxonID. Understanding when a taxon changes sufficiently to warrant an identifier change

deepreef commented 7 years ago

To define what is meant by a "taxon" instance (to which a taxonID is assigned), we need to establish what are the "core proprties" of an instance of a "taxon", whereby if one of the core properties changes, a new taxonID must be issued. I think it's best to narrow the scope of those properties to representing the "contents" of a taxon, rather than the combination of contents and "context". "Contents" in this sense are the items contained within the circumscription of a taxon. For example, a taxon representing a genus would be defined by the set of species contained within it. For example, two different assertions of a genus contain different sets of species: Aus sensu Smith contains (Aus bus+Aus cus) with species "dus" placed in genus "Xus"; whereas Aus sensu Jones contains (Aus bus+Aus cus+Aus dus); then Aus sensu Smith would have a different taxonID from Aus sensu Jones because they have different contents. "Context" in this sense means placement within a hierarchical classification. Changing the context of a taxon instance should not cause a change in taxonID. For example, if Smith and Jones both assert the same contents of the genus Aus (e.g., A.bus+A.cus+A.dus), but Smith places the genus in the family Aiidae, and Jones places the genus in the family Xiidae, we do not need a different taxonID to represent Aus sensu Smith and Aus sensu Jones. Logically, this means that for a species-level concept, if the circumscriptons of both Smith and Jones for the species "bus" are the same (i.e., same heterotypic synonymy), then they have the same taxonID even if Jones treats it as "Aus bus" and Smith treats it as "Xus bus". This needs to be fleshed out in a full document.

mdoering commented 7 years ago

I agree with excluding the position in the classification, the "context", from the taxon concept identity. A for the included species we should find a way to allow newly described species to be added to a genus without change its identity as long as the species were not moved from another genus. See also

As for a final output I agree this needs to be written up somewhere else, probably as part of the API documentation. But in general I would like to get an agreement on the key points in these issues first instead of creating lots of documents that tend to consume a lot of overhead just for styling and explaining the context.

deepreef commented 7 years ago

Yes, exactly! Dave and I discussed this at some length at Woods hole. What it boils down to is this: When a new species is described, what would its type specimen have been identified as prior to the new species description? If it would have been identified as an earlier-named species, then what we have is a case where one larger species was split into two smaller species. As such, the circumscription of the genus doesn't change. On the other hand, in the cases of "brand new" species, which would not have have had ANY taxonomic identity prior to their description, then the concept for the genus would need to change. Obviously, this is subjective in many cases, and not obvious. But from an informatics perspective, I think the cleanest answer involves how the link between names and their corresponding name-bearing types are made. I suggested to Dave that we could have a "retroactive identification" system, whereby it can be asserted that the type specimen of a new species would have been identified as species "X" prior to the new species description. This can be proxied through assertions of heterotypic synonymy, if we don't want to get all the way doen to identifications of specimens. I will take some time this weekend to come up with a better diagram than what is shown above. I actually started one in Woods Hole, so I will finish that and then share it.

deepreef commented 7 years ago

By the way, if we can address this informatically, then we will have also created a REALLY valuable tool to distinguish "true" new species from "split" new species. An analogy of the difference is the two kinds of "Gaps" in Col (actual taxonomic gaps, vs. synonym name gaps). This is important because it distinguishes cases of new species that increase our understanding of the scope of biodiversity, vs. cases of drawing new lines within our already-existing understanding of the scope of biodiversity. We've never been able to do this before.

mdoering commented 7 years ago

Would we want a genus concept to change when a brand new species gets added? It would mean genus identifiers change quite a lot over time and we lose stability. It might be more useful to restrict changes of genus concepts to true splits & merges of genera, ignoring the exact amount of included species for the most part and focus on the genus types as we discussed at some point. This needs real world examples to test

deepreef commented 7 years ago

It depends on how much you want to reflect reality, and it also depends on what you mean by stability. The undortunate reality is that the meaning of a genus-level taxon concept DOES change when a truely new species is added. However, if we want less precise but more stable taxon identifiers for genera, then we can treat them the same way as species. That is, instead of defining them by the circumscription of all individuals, we can limit the definition to be circumscriptions of types (stype species for genus concepts, and type specimens for species concepts). Unfortunately, as we discovered in our discussions at Woods Hole, we lose important information about taxa when we fail to distinguish the case of one species-level taxon that is split into two, vis a brand new species being added (impetus for the diagram in the photo you included above).

Also, "stability" is actually INCREASED with increased precision, because there is less subjectivity in the definition. The problem isn't a loss of stability, the problem is a proliferation of subtle variants (e.g., Aus Smith sec. Smith vs. Aus Smith sec. Jones). All of these variants are themselves stable; but they confuse matters because we have no good way to reflect the differences in meaning between two precisely-defined genus-level taxon concepts.

mdoering commented 7 years ago

Right, the genus concept changes when a new species is added when you look at the included species. But is this really useful for anyone?

It seems to me it is rather about delimiting a genus to other genera that is important here to define the concept. Merging and splitting again. For example the genus Acacia can be referred to as the concept sensu latu including all species nowadays in Vachellia or sensu strictu when you also acknowledge the existence of Vachellia.

deepreef commented 7 years ago

Personally, I'm happy with defining a taxon by the set of "types" it contains. That is, a "species" concept represents the sum of the species-group protonyms (as proxies for type specimens) assigned to it as heterotypic synonyms, and a "genus" concept is the set of genus-group protonyms (as proxies for type species) assigned to it as heterotypic synonyms. To me, that solves 80% of the problem with 20% of the effort. However, as we discussed in Woods Hole, this completely misses the ability to descern the "sensu lato/sensu stricto" cases where an existing species is split into two. That is, no way to distinguish between "Aus bus Smith sec. Smith" (sensu lato) from "Aus bus Smith sec Jones" (sensu stricto) -- when Jones splits Aus bus into Aus bus Smith sec. Jones and Aus dus Jones sec Jones. The same applies to all ranks (Genus and above).

Like I said, limiting it to heterotypic synonymy gets 80% of the job done with 20% of the effort. If we want to go beyond that, I think it would be better handled by a system of "RelationshipAssertions" (sensu TCS).

mdoering commented 7 years ago

Three implementations dealing with tracking taxon concept changes:

mdoering commented 7 years ago

Should the identity stay if just the name changes? E.g. some of the synonyms gets accepted or if the name changes its rank, e.g. a species will be considered a subspecies now? Type and concept wise these are the same so the identifier should not change, correct?

ThierryBourgoin commented 7 years ago

As we discussed already, but too briefly in Woods Hole, I think that defining a taxon (=concept) by its content is not enough or even may be useless.

A taxon (e.g. genus) has its own definition. Adding or removing a species that fits with its definition does not change the taxon definition: it remains the same while its sum has changed! In other terms trying to define a taxon by the sum of its species is not so useful: different sums could lead to the same taxon and then the same UI ! which is not what we want I suppose. I might be wrong but I don’t see this practicable in the issue of UIs. Additionally (even if it would be probably the best to do) I don’t think that we going to suggest changing the UI each time we are adding/removing a species to a genus.

In reverse, with its own definition a taxon carries a series of implicite characters that link it into a special place into the hierarchy (classification of phylogeny). If you change the place where you hang this taxon, you change all these implicit characters that define the taxon = you change the full/complete definition of the taxon -> you change the taxon.

I feel that these are the changes which are really necessary to tract, the ones that are important for CoL.

Not sure I’m clear here ;-)

mdoering commented 7 years ago

@ThierryBourgoin I see your point and it makes a lot of sense. There are various ways to look at what the essence of a taxon is and exactly this is why we need to agree on one definition.

We should probably step back and approach the problem from a users perspective. What does a user want from a CoL taxon and why does it need an identifier at all?

1) someone uses the catalogue at some point and wants to have a persistent reference to the exact version he was looking at that time. That would require a fully versioned CoL with every change triggering a new identifier.

2) people have identified an organism to a CoL taxon, e.g. a specimen or observation. They want access to the current view of the "same taxon" in the CoL that still represents that organism observed. But maybe with a different name, classification or other updated "metadata". This does not require a taxon concept id per se, just a way to get to the (different) identifier for the latest version of the same concept. The concept identifier basically is internal only - but the system still needs to know about concepts. This mostly applies to species- and infraspecific taxa so we probably would not need to worry about higher taxa, but maybe genera.

3) researchers want to aggregate species related information from different systems, all linked to CoL taxa. They want to be sure the different systems talk about the same taxon concept and information can safely be transferred and merged. This seems to require shared concept ids.

From the above I feel we need 2 identifier, one for the exact version and one for the taxon concept to assert a concept is the same.

The question now is how to know that a concept (as in set of all theoretically included individuals) is the same. We can either find a way to automatically detect that or rely on experts to tell us. The problem with experts is that they will apply different judgments to what concepts are. So we will see very inconsistent, equal concepts across various groups. Sth that can be asserted by a computer will be much more useful as its predictable and comparable across all groups.

deepreef commented 7 years ago

Thanks, @ThierryBourgoin and @mdoering -- this is helpful. This conversation is touching on the same problems of communication that have plagued these discussions for several decades now (going back at least to the 1980's). Fundamentally, is that we have different ideas about two issues:

Issue 1 is about what "things" (conceptual entities) do we care enough about to label with a persistent identity. Included within this issue is the question of how to explicitly define these "things", so we know when the properties of one thing (represented by its persistent identifer) should be changed (without changing the identifier), vs. when a new "thing" is needed (with its own distinct identifier). At the heart of this issue is which properties of a "thing" define it (i.e., collectively represent its "essence"), and which merely represent relevant metadata associated with that "thing", which may be altered without altering the essence of the "thing".

Issue 2 is about semantics, that is, which terms do we use to label each class of "thing". The most problematic terms are "name" and "concept". Both have various synonymns and homonyms in our conversations. What has become clear as a result of MANY conversations almost exactly like this one is that we probably have five or six different classes of "things" that we have, over the years, tried to force-fit into two terms ("name" and "concept").

My fear is that if we do not confront these two issues now, we will make very little progress solving these problems from an informatic perspective. Having dealt with these issues (from an informatics perspective) for many years, these are the "things" that I have found useful for persistently representing conceptual objects in the biological taxonomy realm:

Thing 1: An individual human being, or an entity representing an organization created by human beings. I have used the term "Agent" to refer to this Thing.

Thing 2: A text-string label used to represent an instance of Thing 1 ("Agent"), often parsable into "Surname" and "GivenName" (for people), or a hierarchy of names (for organizations). I have used the term "AgentName" to refer to this Thing.

Thing 3: Documentation instance representing assertions made by one or more instances of Thing 1 ("Agent"), at a particular moment in time. The documentation may be a type of publication, or it may be some other form of static documentation. The word "static" here is critical, because the documentation instance represents a snapshot in time, and thus does not change. For retrieval purposes, it is best to associated each instance of Thing 3 with instances of Thing 2 (AgentName), instead of directly to instances of Thing 1 (Agent). I have used the term "Reference" to refer to this Thing.

Thing 4: A string of text characters, typically represented electronically in the form of UTF-8 encoded text, or printed in the form of glyphs rendered as ink on paper, which serves as a Linnean-style scientific name. These text strings may or may not include components representing taxonomic rank, delimiters (such as parentheses), and authorship information (various styles, formatting and with or without years). I have used the term "NameString" to refer to this Thing.

Thing 5: A specific instance of a Linnean-style taxon name represented as a conceptual entity. This applies to a particular unit of a compound name (not the full combination), which has a particular type (specimen or name) in the context of Codes, a particular rank (in the sense of Linnean ranks), and a particlar authorship associated with the creation of the name. This is different from instances of Thing 4 (NameString) in that it is conceptual, not literal. The essence of an instance of Thing 5 is independent of the text string used to represent it. For example, the same instance of Thing 5 might be represented by different text strings (e.g., different genus combinations for a species, different ranks, different spellings, etc.), and more than one instance of Thing 5 might share the same text string (e.g., homonyms, homographs). I have used the term "Protonym" to refer to this Thing.

Thing 6: A particular treatment or usage of an instance of Thing 5 (Protonym) within the context of an instance of Thing 3 (Reference). Important properties of instances of Thing 6 include the exact spelling of the specific name unit (e.g., the species epithet) as it appears within the instance of Thing 3 (Reference), what taxonomic rank the instance of Thing 5 (Protonym) was asserted as within Thing 3 (Reference), Whether or not the instance of Thing 5 (Protonym) was treated as as a valid taxon, or as a heterotypic synonym of another taxon, and a link to another instance of Thing 6 representing the immediate hierarchical taxonomic parent (e.g., the genus into which a species is placed). I have used the term "TaxonNameUsage" to refer to this Thing, but it could also be referred to as "TaxonTreatment" or just "Treatment" (following how PLAZI uses that term).

Thing 7: The set of biological organisms, including individuals that are dead, alive, and yet-to-be-born, which are explicitly or implicitly included within an asserted Taxon. THIS IS THE THING ABOUT WHICH WE ARE DISCUSSING Most people I have discussed these issues with over the years have applied the term "TaxonConcept" and "Circumscription" interchangably to refer to this Thing. However, as per @ThierryBourgoin comments above, perhaps we do not have universal agreement that "Concept" and "Circumscription" are synonymous terms. Therefore I propose we use the term "Circumscription" to represent this Thing, to avoid confusion going forward.

Thing 8: This is the Thing that @ThierryBourgoin refers to in his comment above as a "Concept". Basically, its properties include elements of both Thing 7 (Circumscription, or set of included child entities), as well as Thing 6 (TaxonNameUsage/Treatment), such as the hierarchical classification, treatment as valid or not, and how the name is spelled. Therfore, it is different from Thing 7 (Circumscription) because it is defined by more than just the child items it contains, but it's not the same as an instance of Thing 6 (TaxonNameUsage/Treatment), because there many be many instances of Thing 6 (TaxonNameUsage/Treatment) that all imply the same instance of Thing 8.

I apologize for this long post, but there is a reason we've never solved this issue as a community during the past few decades. Unfortunately, most of that reason has to do with miscommunication, and most of the miscommunication has to do with a mixture of how we define our core objects (Issue 1) and what terms we use to represent them (Issue 2; i.e., semantics).

I believe that we already have well-tested, non-contentious definitions for Things 1, 2, 3, and 4. After the dinner conversation in Woods Hole, I am confident we can fairly quickly settle on a clear definition for Thing 5. If we can achive that, then the definition of Thing 6 is extremely easy. Therefore, the real issue for us to deal with is whether Thing 7 and Thing 8 need to be different Things, or if we can adequately accomodate them with a single Thing. Originally I thought we could get by with a single Thing, but after the comment by @ThierryBourgoin and @mdoering above, it seems we should serious consider defining them as separate things, each with their own identifiers.

In either case, I think it's important that we understand the difference between defining what Things we need to manage in CoL-Plus, and deciding which terms to use to refer to those defined things. I think it would be a grave mistake to start defining data models and such until after we come to consenses on the Things we're managing, ans the terms we're using to refer to those things.

Phew... and this is just the BEGINNING of the discussion!

deepreef commented 7 years ago

One more point.... in response to the comment by @mdoering above, "versioning" of CoL representations can be handled in several ways: 1) Internally using version histories for the same identifiers plus a date-stamp; 2) Geneating new identifiers to represent each version; 3) Capturing each new version via a new instance of Thing 6 (with Reference representing CoL as the Author and the date of the change as the date, and the properties of spelling, validity, classification, etc.)

There are other ways as well, but #3 above represents the simplest in terms of coding and implementation.

mdoering commented 6 years ago

Linking the drawing from the Woods Hole CoL meeting April 2017 illustrating changing concepts (numbers) over time with types indicated by colored dots: Concept Changes

Original single species A.bus gets split into A.bus and A.fus. A.bus s.str is then merged with A.xus. Knowing the types alone is not in all cases enough, otherwise A.bus s.l. (1) would be the same as A.bus s.str. (2). But when you know about all the species within the genus and know A.bus is also a pro parte synonym of A.fus you can derive the unique concepts

ThierryBourgoin commented 6 years ago

I think we need to be precise here about the words we use… (concept, step of concept

If I reed correctly the figure: We have only here 3 different taxonomic concepts: A. xus, A. bus and A. fus. 1960: taxon A. bus s.l. is described (1) 1970: taxon A. xus (4) and taxon A. fus (3) are described. Some specimens of A. bus s.l. belongs to A. fus. We have 2 new concepts (3) and (4) + 1 old concept (1) more restricted BUT still the same concept. 1980: A. xus is synomized with A. bus s.s ; A. fus remains. We have 2 concepts (1) in still another step, and (3).

1, 2 and 5 are different stages.steps of the same taxonomic concept.

Type-bus (red dot) is the same in all stages/steps of the life of the same taxon A. bus (s.l., s.s., and including A. xus). So yes a type does represent all the stages of the life of a taxon, but this is not what it is supposed to do: it is just bearing the name for this taxon. The type has nothing to do with the concept understanding, it is just the bearing-name specimen for this concept. This specimen is only one in the many others that “make" the taxon, it provides the link between nomenclature and taxonomy.

In this example the taxonomic concept for A. bus remains the same, it just evolves in time according to its content (=extension) more or less restrictive (different steps/numbers of the same concept): succesive stages/steps: 1, 2 and 5. => a same concept may have different successive names according to its extension.

However concepts are defined by 1) their content (extension = set of children-taxa/specimens to which the concept applies) AND 2) also by intension (list of its characters = its description) - and not by the type specimen. If a taxon is transferred to another parent taxon with its set of children-taxa (a genus from one tribe to another tribe, a species from one genus to another genus) it changes by intension (its characters/description are/is changed). Accordingly in that case this is no more the same concept ; we have 2 concepts: an old one and a new different one, although it keeps the same name! (excepted brakets in the case of species transfered in another genus). => a same name (particularly in supraspecific taxa) may refers to different concepts.

This is why 1) defining taxa by their extension only remains insuffisant (my issue in Woods Hole meeting) and 2) speaking of a taxon without referring to its classification (e.g. sec. author) might introduce strong biais if not even errors in any taxonomic database is we don’t take care of these very particular inferred links (my point/talk in Xishuangbanna meeting).

mdoering commented 6 years ago

@ThierryBourgoin so you say all 163 Acacia species that have been moved to the genus Vachellia should be considered different taxa describing a different set of organisms? Identifications to Acacia aroma cannot be safely transferred to Vachellia aroma as their circumscription is different?

mdoering commented 6 years ago

@ThierryBourgoin can you explain what you have in mind when the concept is more restricted but still the same concept? That sentence to me contradicts itself. If some specimens/organisms are excluded it is clearly different.

ThierryBourgoin commented 6 years ago

I try take an example fro what I've in mind:

Taxonomic concept of the giraffe (G. camelopardalis) has recently been disputed (and still is so far I know) and the species concept been ‘restricted' to the “Northern giraffe”, while 3 other species were recognized (reticulated, Southern and the Masai giraffe)… I regard the initial taxonomic concept of what is G. camelopardalis (s.l.) being still the same but it has been restricted (s.s.) to the north African populations.

Let us say that new analyses will conclude in the future that it is not the case for 2 of them, the Southern and the MasaI taxa. Therefore these 2 separated species will come back ‘inside’ the taxonomic concept of G. camelopardalis which will be more widely understood than now but still less than originally. These are just successive steps of in the circonscription of the same concept view by extension.

Now let us say that new analysis by author NNN would show that Giraffe is not a Ruminant (Ruminantiamorpha) and should be move from Giraffidae to whales in Balaenidae ;-) Then Giraffa would be characterized by its own characters of course (the ones that allow to recognize the set of all its included subtaxa) but also by all the characters of Balaenidae and not the ones of Giraffidae. For me this new definition by intension (new list of characteristics of Giraffa, including those of Balaenidae) would make the taxonomic concept a totally different one for Giraffa sec. NNN.

I don't know if I could write it this way but in other words I would say that changing the content of a taxa does not change it (as a taxonomic concept), but changing its characteristics that it share with other taxa (what we do with taxanomic transfers) yes. From your example Acacia and Vachellia remains the same concept, respective with a more restrictive or wider understanding of their taxonomic concept, but Vachellia aroma and Acacia aroma are two different taxonomic concepts.

mdoering commented 6 years ago

Thanks @ThierryBourgoin, for identification purposes it is important that we capture the different opinions over time. In the terminology I propose here this means the concept of which populations are in and which are out does change, even though the type remains. In your example of a hypothetical merge of the Southern and Masal species back into G. camelopardalis we would actually have 3 different concepts over time, all known under G. camelopardalis. Referring to all 3 of them as the same concept would not allow us to deal with identifications accurately.

Take a look at iNaturalist to see why that is important for handling (historical) identifications: https://www.inaturalist.org/pages/curator+guide#changes Actual changes they track (unfortunately both Acacia and Giraffe are outdated): https://www.inaturalist.org/taxon_changes

A good bird example for a split based on distribution ranges: https://www.inaturalist.org/taxon_changes/32924

mdoering commented 6 years ago

I do understand your point about intension. The classification should be significant in characters that define the taxon. But in many cases these do not alter the unit of populations that make up the taxon. The important part is that as long as the populations which make up the taxon do not change the taxonomic concept has not changed. Even if the circumscription might now include some more or less characters. The primary anchor point is the group of populations that form a stable unit, not how exactly we characterize them. From Wikipedia:

In biology, a taxon is a group of one or more populations of an organism or organisms seen by taxonomists to form a unit. It is not uncommon, however, for taxonomists to remain at odds over what belongs to a taxon and the criteria used for inclusion.

mdoering commented 6 years ago

As in practice it is difficult to assess whether a change in characters has an actual effect on the size of the included populations it probably makes sense in some cases to track these concepts in their minute details. But this leads us again to an explosion of concepts. Every identification key will define its own concept, every change in classification yet many more. For the purpose of dealing with identifications, whether observations in GBIF or specimens in collections, we would like to have more stable identifiers though than names, not less stable ones.

ThierryBourgoin commented 6 years ago

Hi Markus,

I think we agree. A same taxa might have different changes in its concept (how it is understood), and tracking these changes is crucial. But all these changes are not equal.

;-), Th.

mdoering commented 6 years ago

yes. All 3 are probably best dealt with as different identifiers if you need all of them. I am just not sure if we do have users that need all of them. For number two I am sure we have.

dremsen commented 6 years ago

I'm very happy to see this thread back in action and wish to contribute constructively. I need to spend a bit more time reviewing all of this to get back in this frame of mind but I have two immediate comments.

I do not believe that the addition of a new specimen to a taxon changes the concept. The concept is not the specimen. The link between the identifier and the specimen is only through the concept. This is very clear within the famous Triangle of Reference model. In taxonomy, concepts are ideas expressed as publications (sometimes poorly) and anchored with the type. Specimens conspecific with the type are instances of the concept, not new concepts. This is why heterotypy must be the means by which concepts are expressed. The giraffe example is almost identical to the graphic example from Woods Hole (which shows five distinct concepts).

I remain unsettled regarding the higher classification being a property of the concept. Paul Kirk and Jerry Cooper were very resolute on this matter in regard to homotypic synonymy where a taxon was transferred to a different genus. No circumscription change and hence no concept change. A genus transfer is just a smaller iteration than a transfer to a higher group.

If a giraffe is transferred from the ruminants to the whales, then I can see this being a major change in what the whale group is but has the giraffe changed? I can see where a single concept might be sorted into different categories by different parties without the concept itself having to be changed.

For example, when David Patterson inserts the Choanoflagellata as a parent for all metazoa in his Union classification, does he really create all new concepts for all the fulgorids?

DR

ThierryBourgoin commented 6 years ago

Hi Dave. Yes I'm also happy to see all this back again... ;-)

In fact my point here is that

ThierryBourgoin commented 6 years ago

In fact my point here it that I would like to be sure that we don't have to redone again this exercise later, because the schema we are using to represent taxonomic knowledge is not enough complete. It was not necessary 20 years ago to separate names from taxa...

;-) Th.

dremsen commented 6 years ago

Thierry, Certainly I agree with this last sentiment and so wish to be very careful. We need an identifier system that is tractable and has practical value while at the same time being precise enough to have meaning. My perspective is mainly as a user with a particular set of use cases and as a developer examining and trying to model concepts as presented in monographs and fauna's.

mdoering commented 6 years ago

If there is no use case I don't think we should implement it. Keep things simple. It is not bad to refactor things in a few years, but to create something which is not used in the first place is wrong.

The ever changing identifiers in the CoL have been a huge problem for its uptake, we need something far more stable. And in my opinion (based on use cases from GBIF, Collections, iNaturalist and others) something to hold on to a stable taxon regardless of its name. Such a taxonID paired with a nameID is very powerful and would be a serious game changer

dremsen commented 5 years ago

I saw the update came in and wanted to check in. Where do we stand on taxon concept IDs? I've been giving them a lot f of thought recently. I think there are use cases for them. I think they are tractable. I think we can accommodate Thierry's interest in supporting the classification as a component of them. But, referring to a 180918 comment of Thierry's, a separation of names from taxa, or more specifically, syntax from semantics, is a requirement.

mdoering commented 5 years ago

I talked with Nico Franz about this in Leiden and he is considering to look for funds. I basically still believe the original idea we had in Woods Hole makes a lot of sense and I would like to implement that next year as an experimental feature. It came up in Leiden as a requirement for many people/projects. E.g. legal documents of the EEA.

The issue about including the classification in it needs discussion, but I am convinced that we should go for two different concept ids in that case. One that includes the classification and one that purely looks at the included set of organisms.

dremsen commented 5 years ago

Agree with you on both counts. I have some rebuttals regarding this really being a component of a concept-by-intension component of the concept but, if they are separate IDs where the classification is distinct from the circumscription, and the circumscription is based on the sets of included protonyms, then I'm right with you. I see this as a requirement for many other types of users too, especially in eco and conservation uses.

mjy commented 5 years ago

I talked with Nico Franz about this in Leiden and he is considering to look for funds.

If Nico can make his engine available I think many questions are resolved.

dremsen commented 5 years ago

I've had good conversations and relations with Nico and worked hard to verify we are in congruence in our views on concepts. I think he's great.

ThierryBourgoin commented 5 years ago

Thanks Dave. Unfortunately I ur paper was rejected last week as one reviewer said in 5 lines (...) that NCBI has already do that and our proposal is not practical ! My only aim with this paper was to alert that separating names from taxa might not be enough to report fully enough in the future taxonomic knowledge ... probably this was not enough clear😕.

Any way we are working on a new version with Nicolas, René and Regine (in copy) and we l’ll try to tackle the issue from the iUID perspective with providing some rules when we should be considering having a new taxonomic concept when taxon definition by extension or intention change. But Also in fact, we are even not sure that the taxonomic triplet (name, taxon, classification) is enough for a complete accurate formalisation of a taxon for digital purpose... we are also working on this. BW. Th.

/ Th. Bourgoin - iPhone

/ Th. Bourgoin - iPhone

Le 30 oct. 2019 à 18:11, David Remsen notifications@github.com a écrit :  I saw the update came in and wanted to check in. Where do we stand on taxon concept IDs? I've been giving them a lot f of thought recently. I think there are use cases for them. I think they are tractable. I think we can accommodate Thierry's interest in supporting the classification as a component of them. But, referring to a 180918 comment of Thierry's, a separation of names from taxa, or more specifically, syntax from semantics, is a requirement.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

dremsen commented 5 years ago

Nico's system relies on articulations that assert concept relations but requires a source for them. In his demonstrations these usually are supplied from an external source. The protonym model provides the means to supply computable articulations to that system.

mdoering commented 5 years ago

Yes, most articulations one should be able to retrieve from the basionym/protonym relations as the proxy for a shared type. BUT these are not necessarily equals relations if you think about aboves image and splits and merges. Also I would think it is relevant to know WHEN the taxonomic was last updated as one can then discard names that have been described since, e.g. if its known that a split occurred later, then the use of that name might be precisely linked to the former sensu latu concept. I am not convinced we can encode all our knowledge into RCC5 relations easily. Or if so, this at least is the difficult part.

dremsen commented 5 years ago

Date is indeed important. But this should be a component of the source properties I should think.

rdmpage commented 4 years ago

Feels like I'm gate crashing this discussion, but blame @mdoering for bring this thread to my attention ;) Frustrated by my attempts to make sense of taxa and taxonomic names in Wikidata (which is fast becoming the broker for taxonomic identifiers, and indeed any sort of identifier) I have been revisiting taxonomic concepts, etc. When will I learn?

Reading this thread I'm overwhelmed by a sense of "here we go again", however I want to suggest an approach that I think would be both doable and create real value for the wider community. @deepreef https://github.com/CatalogueOfLife/general/issues/6#issuecomment-300336258 teased out eight(!) things that are being discussed, which I basically agree with, except I would junk 7 and 8. That is, I don't think defining taxa intensionally (#8) makes much sense (this is something you compute based on a tree after the fact, it conflates defining something with learning about it), and I don't actually think circumscription is something a taxonomic database is best for (#7), in the sense that the bulk of "circumscription" is happing elsewhere (e.g., iNaturalist users saying this is a photo of "x", DNA barcoders saying this sequence belongs in BIN "y", etc.). Even if a taxonomic database had circumscription, why would iNaturalist or BOLD or even GBIF use those rather than the circumscriptions they generate themselves? We can get higher taxon circumscriptions easily enough from a classification, but the notions such as changing set of species means the genus is somehow a different taxon seems somehow unhelpful. And don't get me started on the bizarre approach the Atlas of Living Australia takes to changing taxon identifiers almost daily.

So, this leaves #5 and #6, namely "protonyms" and "usages" (I'm taking #1 - #4 as essentially given, maybe subject to tweaks).

So, as I sketch out Taxonomic concepts: a possible way forward here, it seems to me that a really useful tool would be something like this:

First, every protonym gets a nice, human-readable identifier, for example a combination of species epithet, author, and year. Whatever it takes to be human readable and unique (the blog post talks about previous efforts at "uninomial" nomenclature, which is the inspiration. Linked to this identifier is every homotypic synonym of that name. This would enable a user, for example, to have a stable identifier for a species that didn't change when the species was moved to a genus. This is essentially #5 (I think). One immediate advantage is that the sort of classification comparison that, say, eBird does, becomes available to all, because there are stable identifiers for species names (and all its variations). it would make Wikidata's life easier as it would need only one of these identifier for each species (regardless of what particular genus and species pair it treats as accepted).

Then imagine that same identifier is linked to every "usage" (name + reference pair) that we consider to be relevant, including heterotypic synonyms. This would enable a user to generate things like the current name and all synonyms, as well as go back and generate a snapshot of what the taxonomy was in, say, 1990. I think this is basically an aggregation of #6, and is close to the notion of a taxon concept being an "according to" statement.

One could imagine an interface (both web and API a bit like):

Everything else (actual "content" of each taxon, implications for characters of taxa, etc.) are all things one could compute from the classification if you wanted, but I think these are really separate things. And I struggle to see the demand for them globally, as opposed to what may well be intense interest in specific cases.

But I think there is a global need for a stable way to refer to a "taxon", and I think this might be a way forward. It's one step beyond names in that is expressly linked to information about the name and its use, but it's relaxed enough for someone to be able to just link to an identifier without having to determine if the "concept" exactly aligns. It avoids what feels like a black hole of defining taxa by extension or intension.

If, for example, the identifiers were DOIs, clean and human readable, I imagine this could be enormously useful, and solve genuine and tractable problems.

dremsen commented 4 years ago

Happy to read Rod’s post. The protonym model is the way to model concepts. I’ve argued for this for too many years. It was the basis for the uBio data model. It separates syntax from semantics, providing an objective basis for defining computable taxon concepts. This separation is critical and remains a fundamental problem for the long term viability of the CoL because one cannot mint taxon identifiers without it. A list of the world species without the means to properly provide species identifiers is a problem.

In uBio we had NameBank which grouped strings into lexicons into names into protonym groups. In ClassificationBank taxa were (implicitly) groups of protonyms. Different treatments of the same name could be compared by their protonym array. This is how taxa are represented within annotated catalogs and treatments. Circumscriptions via specimens or literature always are tied to a name that are tied to a protonym or a treatment (asserting a taxon inclusive of a set of protonyms). The structure was there, I just didn’t have all the data properly mapped.

Until we have a system that cleanly separates names from concepts (i.e., syntax from symantics) we don’t have the right system. When we do we can properly catalog objective synonyms independently from subjective synonymic assertions, we can acquire a useful objective dataset that we don’t have to toss every time new evidence changes taxon or pulls a GSD, etc. and we can enable an inclusive and applied taxonomic infrastructure that doesn’t artificially cover up the natural flux and ferment that is taxonomy. We can also support the more granular and refined taxonomic use cases required by the Nico Franz’s of the world.

The only question for the COL should be whether we are in this space once and for all or just waiting for someone else to do it. Until then, the job isn’t done.

mdoering commented 4 years ago

Thanks @rdmpage. Your design following just the protonym/type was I had initially hoped would solve it for us too. But I think this is flawed for really important use cases. We want stable taxon ids to track splits and merges so that an occurrence of species A. bus sensu 1960 on the whiteboard is not confused with A. bus sensu 1970. These are two different taxa with the 1970 one being a subset of the 1960 one. There are 5 concepts in those 6 usages in the diagram which I would really like to attach 5 different ids to. Thats what Avibase does.

dremsen commented 4 years ago

Is there a place where that is modeled (or laid out) so we can look at those cases? To which diagram are you referring?

mdoering commented 4 years ago

@dremsen The picture of the whiteboard at the top of the github discussion you are on, Dave

dremsen commented 4 years ago

Sorry Was using email not GitHub. moving now

dremsen commented 4 years ago

Would Avibase model that with 5 concept IDs or 6 concept IDs?

dremsen commented 4 years ago

Is there a previous discussion on modeling splits?

rdmpage commented 4 years ago

Thanks @rdmpage. Your design following just the protonym/type was I had initially hoped would solve it for us too. But I think this is flawed for really important use cases. We want stable taxon ids to track splits and merges so that an occurrence of species A. bus sensu 1960 on the whiteboard is not confused with A. bus sensu 1970. These are two different taxa with the 1970 one being a subset of the 1960 one. There are 5 concepts in those 6 usages in the diagram which I would really like to attach 5 different ids to. Thats what Avibase does.

@mdoering There's no reason why this can't be included. In the same was you could, say, append a time stamp to the t/id you could imagine doing the same for a specific usage so you would have an id for a specific usage if you wanted. As an analogy, imagine a web page showing the history of a name and each usage (name + reference) has a fragment identifier, e.g. #1970. The idea of suffix identifiers comes from ARKs which I don't particularly like as an identifier but they do support suffixes (could also mint DOIs with suffixes). Whatever the implementation I think you can have what you seek. We could regard identifiers as hierarchical. By default you get the original name /n/, if the system has a list of usages then /t/ gives you that, and /t/xxx#1970 gets you a specific usage. I guess I envisage some sort of graceful degradation where you always get something.

mdoering commented 4 years ago

The seagull Larus argentatus got split into Larus argentatus and Larus armenicus. There are 3 ids in iNaturalist for them, one for Larus argentatus s.s. and Larus argentatus s.l.:

https://www.inaturalist.org/taxon_changes?taxon_id=204533

Both Larus argentatus taxa share the same name thus surely also the same protonym. Avibase might even have more concepts, but I dont immediately understand that webpage: https://avibase.bsc-eoc.org/species.jsp?lang=EN&avibaseid=F002188E226DF09C

mdoering commented 4 years ago

@rdmpage I was thinking similar. Like in the Plazi timeline you nail down the concept by the timestamp. But concepts exist also in parallel and do not follow a sequential timeline.

Also the main problem here for me is how do I know when looking at the data that the concept has changed? Ignoring identifiers completely as we want to (re)assign them in the process. What warrants a taxon change and when does it remain the same? We need to find objective rules what the algorithm has to do. It also gives a real meaning to CoL ids as they are based on objective rules.

mjy commented 4 years ago

For the record we use the label "Protonym" to refer to a nomenclatural concept, and OTU to re refer to a biological concept. Taxa/OTU (biological things) are not Protonyms in the example below. Talking about biological things being Protonyms seems inherently confusing to me.

Given that, do this:

For what it's worth we have 100s of thousands of taxon names, OTUs, specimens, citations, and identifiers following this approach in TaxonWorks, i.e. it's not an imagined approach.