Define rules for a stable taxonID. Understanding when a taxon changes sufficiently to warrant an identifier change

My recollection of the AviBase model (which could be wrong) was that everything got a distinct taxon id (even if their 'computable' circumscriptions were identical). Subsequent articulations would establish they were congruent.

@rdmpage I was thinking similar. Like in the Plazi timeline you nail down the concept by the timestamp. But concepts exist also in parallel and do not follow a sequential timeline.

Also the main problem here for me is how do I know when looking at the data that the concept has changed? Ignoring identifiers completely as we want to (re)assign them in the process. What warrants a taxon change and when does it remain the same? We need to find objective rules what the algorithm has to do. It also gives a real meaning to CoL ids as they are based on objective rules.

@mdoering I think "when does it change and when is it the same?" leads to madness. And it's separate to the identifier issue, in that at one level every taxon that includes a given protonym would have the same identifier. Every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to? I don't know of any particularly useful way to say whether a taxon is the same or not (that doesn't quickly lead to absurdity) but you can ask whether the taxa share types. I guess I'm arguing that any approach that asks either "what is a taxon" or "when are two taxa the same" is digging a hole for itself.

Usages are much more friendly to work with if you think of publications. Then they are immutable and you have a limited number of them per name. But for dynamic databases it becomes a different problem. The system has to decide when to create new identifiers. In this sense every monthly release of the CoL would be a different usage and has its own set of taxon ids. I doubt this is very useful to anyone. It is much like ALA and what CoL used to do. Identifiers change all the time. Names are actually more stable.

But what we want are more stable ids than the name ids. To keep the taxon identifier the same if the concept is still the same regardless of its accepted name. But that requires either a human to do an assertion or a machine to compare taxa for equality. There is no way we can get human assertions for million of taxa every month. And they would also be very subjective and the rules applied to judge would differ a lot.

My goal is to start with the type specimen as the anchor for a taxon, but refine that for splits and merges by comparing adjacent taxa and their types. If you have a globally complete taxonomy and compare several versions of it (1960/70/80 in aboves example) a missing protonym for A. fus tells you the A.fus you are dealing with is from 1960. And the presence of A.bus as a pro parte synonym in 1970 for 'A.fus' tells us its a split. So we know (1) is the union of concepts (2) and (3).

The goal is to create stable taxon ids as anchor points to link identifications to. The current name can then happily change and if a split or merge happens the id will change and the identification is still referring to the old broader or narrower concept.

Well, the base for the taxon is the set of types. In the diagram this already identifies 4 out of 5 concepts. The A.bus s.l. vs s.s. is not covered and they would both fall into the same id. But maybe that is still a good start for CoL. It's simple, straight forward to implement and is definitely a step forward from name based identifiers which most systems have including the current CoL and GBIF. And its easy to communicate and most importantly it can be derived from the data

Markus, thanks for those clear statements. This is the direction I also favor.

"Fundementally doomed indeed. :("

Types frantically for 20 minutes than deletes everything.

Can anyone answer whether a taxon concept sensu the title of this issue has biological attributes?

I.e. do you imagine that following triples are part of the system:

my_taxon_concept_uri has_some height
my_taxon_concept_uri has_color purple
my_taxon_concept_uri eats snails

No they clearly won't. No traits and description based circumscriptions are planned to be in CoL. And when I write about types we can manage type specimens, but I doubt we ever list them for all species. So using the protonym as a type proxy is what will be done.

@mjy

Can anyone answer whether a taxon concept sensu the title of this issue has biological attributes?

I.e. do you imagine that following triples are part of the system:

my_taxon_concept_uri has_some height my_taxon_concept_uri has_color purple my_taxon_concept_uri eats snails

I can't answer for the thread, but I only got into this now because this is the issue that arises in Wikidata. People are adding attributes like these to Wikidata "taxa" when it seems clear that many such "taxa" are names not taxa (in the sense that homotypic synonyms may have their own Wikidata items, so clearly "taxa" aren't always "taxa").

So I guess where you are going with this is what do we hang attributes on? I was imagining that, again, we could have a hierarchy of identifiers. You could just hang it on the name, or the protonym (in the sense of the complete set of usages of the name), or on a specific usage (equivalent to "sensu Fred").

@mdoering

Well, the base for the taxon is the set of types. In the diagram this already identifies 4 out of 5 concepts. The A.bus s.l. vs s.s. is not covered and they would both fall into the same id. But maybe that is still a good start for CoL. It's simple, straight forward to implement and is definitely a step forward from name based identifiers which most systems have including the current CoL and GBIF. And its easy to communicate and most importantly it can be derived from the data.

I wonder if part of the problem is the notion of "concept" and that each box in the diagram needs its own identifier of the same "class". Put another way, I would have three "paths" or timelines, one for each type. Three "protonym" identifiers, one for each. Each identifier points to the entire history of each type , and events along the way are marked on those timelines. Each one of those events gets an identifier ("usage"). So you can still refer to A.bus s.l. or A.bus s.s by referring to a given usage. Now, some of these paths will intellect in the sense that someone may say that these two things are heterotypic synonyms, so the graph would need the option of having an edge between two paths (I think this is essentially what the Australian NSL does in their model).

I'm still waving my arms around here (can you tell? ;) ) but I do wonder if part of the problem is seeing things as boxes rather than as timelines.

@mdoering

Usages are much more friendly to work with if you think of publications. Then they are immutable and you have a limited number of them per name. But for dynamic databases it becomes a different problem. The system has to decide when to create new identifiers. In this sense every monthly release of the CoL would be a different usage and has its own set of taxon ids. I doubt this is very useful to anyone. It is much like ALA and what CoL used to do. Identifiers change all the time. Names are actually more stable.

Argh, no! Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same. Secondly, these things would (if they are changes to existing records) simply be a new usage along the path for that "protonym". The protonym itself would be stable. Someone linking at that level of resolution (e.g., "I don't care about the details, it's Drosopholia melanogaster as far as I'm concerned") wouldn't be affected.

Put another way, we're talking about versioning, and it seems to me the only sensible way to deal with that is enable people who don't care about versions (which, I'd guess, will be 99% of people) to have something stable to point to (which is more or less what names have done for us), and give those who do care something they can point to.

I was imagining that, again, we could have a hierarchy of identifiers. You could just hang it on the name, or the protonym (in the sense of the complete set of usages of the name), or on a specific usage (equivalent to "sensu Fred").

Yes, we will have different ids for a name and a usage. But we want to keep the CoL usage ids stable across versions (currently monthly) so we need to figure out whats considered still the same taxonomic concept. For names it's simpler, but even there it is not obvious as it still requires a clear definition of a ScientificName.

The current version of CoL only generates stable ids for names.

Argh, no! Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same.

That raises again the question which properties exactly belong to a usage. We are dealing with a graph where everything is connected. What are the boundaries of the object to version? Is the entire classification included or just the parentID as in our model? What about children? What about the structured reference of the publication? Every single distribution record, vernacular name etc does have impact too? If thats all in, you easily end up to have new versions all the time.

Secondly, these things would (if they are changes to existing records) simply be a new usage along the path for that "protonym". The protonym itself would be stable.

Are you sure the protonym is stable? It is in the theory and because it has been published on paper. But in a database world? See above for a usages boundary. Adding a vernacular name could break it. I guess it would be rather the protonyms name id.

Put another way, we're talking about versioning, and it seems to me the only sensible way to deal with that is enable people who don't care about versions (which, I'd guess, will be 99% of people) to have something stable to point to (which is more or less what names have done for us), and give those who do care something they can point to.

Yes

@mdoering I haven’t kept up with CoL’s data structure, but naively I would have said that the “concept” is the latest name + reference combination (e.g., A. bus + DOI:10.1234/xyz) and if there’s not a more recent usage then the id for the latest “concept” would be unchanged). I put “concept” in quotes because it seems that everyone has a different idea of what that is.

It’s also not clear to me who the indeed users are, and what their expectations would be. Clarifying that presumably would affect what identifiers to expose.

On 22 Aug 2020, at 09:59, Markus Döring notifications@github.com wrote:

I was imagining that, again, we could have a hierarchy of identifiers. You could just hang it on the name, or the protonym (in the sense of the complete set of usages of the name), or on a specific usage (equivalent to "sensu Fred").

Yes, we will have different ids for a name and a usage. But we want to keep the CoL usage ids stable across versions (currently monthly) so we need to figure out whats considered still the same taxonomic concept. For names it's simpler, but even there it is not obvious as it still requires a clear definition of a ScientificName https://github.com/CatalogueOfLife/general/issues/35.

The current version of CoL only generates stable ids for names.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/general/issues/6#issuecomment-678615971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUK2RBUHGP2A7QJUQX7SDSB6CIFANCNFSM4DKBXVWA.

@mdoering

Argh, no! Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same.

That raises again the question which properties exactly belong to a usage. We are dealing with a graph where everything is connected. What are the boundaries of the object to version? Is the entire classification included or just the parentID as in our model? What about children? What about the structured reference of the publication? Every single distribution record, vernacular name etc does have impact too? If thats all in, you easily end up to have new versions all the time.

I think this is a more general question about versioning graphs, and there’s a literature on that. Naively, I think in terms of edits between graphs, especially as this seems to capture the way taxonomists describe their work (e.g., “we created a new genus, and species x and y are transferred there” is essentially an edit script for transforming one graph into another). The other things you describe (publications, distribution records - really?) can ether be treated as separate nodes, or as metadata (I gather there are ways to version graphs that treat node properties separately from nodes). But just because everything is connected doesn’t mean you can’t isolate changes in pretty much the same way you can do a diff on text to isolate the insert/delete/move events.

Secondly, these things would (if they are changes to existing records) simply be a new usage along the path for that "protonym". The protonym itself would be stable.

Are you sure the protonym is stable? It is in the theory and because it has been published on paper. But in a database world? See above for a usages boundary. Adding a vernacular name could break it. I guess it would be rather the protonyms name id.

Put another way, we're talking about versioning, and it seems to me the only sensible way to deal with that is enable people who don't care about versions (which, I'd guess, will be 99% of people) to have something stable to point to (which is more or less what names have done for us), and give those who do care something they can point to. Yes

I guess I imagine a list where the protonym is the head of the list, and you append usages (name + reference) of any name connected to the type to that list. This list, combined with lists for other protonyms will form a graph. Snapshots at a give time will be a classification. And again, I think you can separate node properties from nodes themselves.

Yes, this is all a bit arm wavvy, but I’ve built trees for different versions of the Clements Checklist of birds using eBird ids for species, and you can clearly isolate the changes by comparing the trees using a tree-based diff. Incidentally this is only possible because eBird keeps stable ids for species independently of the current name. I’m currently trying to do the same with the reptile database where, mercifully, there are internal integer ids for species that remain stable even if the name changes.

I guess I see taxonomy as essentially a series of distributed edits on a graph, and if we capture those then we have basically captured what taxonomists generate in their work.

As an aside, here's a screenshot of a comparison between two recent classifications of snakes from the Reptile database. I use specific epithet plus internal integer id to identify each species (the id doesn't change for the species). This particular difference shows moving species from one genus to another (a move operation), taxa in light gray haven't changed. There's some added complexity that means the genus name itself has to change, but hopefully you get the idea. So I would regard things like "yunnanensis-18606" to be "prototnym-style" identifiers that are linked to the complete fate of names attached to the type for that name, and for which you could recover that this species has moved from Sinonatrix to Trimerodytes (and probably other moves if we go back further in time).

Personally I would postpone any discussion of whether a taxon is the "same", because I don't think there's a unique answer (what means "same"?). But if you have the history then you enable people to determine "same" or not for their definition of "same".

What I also like is we can attach a publication to this particular edit operation (i.e., the research that lead to the move), so it is linked to evidence, and to the people who did the work.

Screenshot 2020-08-22 at 06 39 53

Having a connected graph doesn't make things impossible, but it needs a definition or specification of what we want to version. Surely you can ignore vernacular names, distributions and the ancestral classification. We just need to agree on what is relevant.

And for CoL we want to provide one definition of a concept. Even if it is not universal and there will be legitimate reasons to have others ids, I strongly believe it is very useful to have a global taxonomy with some sort of stable taxon identifiers that people can hang things like identifications on as long as the definition of what the id refers to is clear to everyone. And thus when it changes or not.

Having a stable protonym anchor is nice, but its too unprecise for many purposes (splits/merges). And having a long list of usages again is inflationary as long as we do not have concept relations between them. My goal is to provide identifiers for linking information that lie in between the two and are more stable than name based ones.

@mdoering where I'm going with this is that if you (or the CoL team) can't answer that question concretely/precisely then you will never get further towards answering this question.

Even if you can answer this question concretely I'm around 99% positive that you can not do what you want to do given the data you are given by the GSDs (no biological data, most without OTU ids). You have been tasked to do the impossible by the CoL. I'm serious. How, possibly, can you provide something more stable than the incoming data if those data don't have the requisite stability in the first place? This seems so obvious it's frustrating. We get data without the needed facts coming in, we mix it up, and VOILA BETTER FACTS. HUH!?

IMO @rdmpage is nailing all the key AHAs:

Don't think of the system as changing in the sense of editing one new node and turning that node into another, think of it as accumulating facts over time, always adding new nodes. Names don't become valid, or invalid. They are both, at different times. Name's don't split or merge, they are used in different ways with reference to a citation.
As Rod alludes you can not, as I mention above, figure out sameness of a biological taxon, unless a) it is asserted with OTU ids by the provider (nobody does this) or you have an algorithm that computes across biological data. You certainly should not go there (again, what you specifically are tasked with for GSDs is an impossible task).
Nobody has implemented Franz' system to scale, which is the what you'd need if you wanted to discus biological merges etc. Most people use the proxy of cited history of protonyms with a 1:1 basis of biological entities. However for merging across different datasets this does not hold. I.e. within curated assertions by curator(s) they by proxy to the nomenclature relationships map to biological concepts, but outside that set of curated assertions that proxy doesn't hold. IMO Rod is absolutely correct that taking the step to manage merges (using Nico's system or others) is a HUGE amount of hard work that almost nobody will do unless we have radically new software, even then, unlikely.
Protonyms (monomials) need IDs, and in our world (TW) they are stable, and in the world you want they need to be stable IMO. This is not "arm wavy"- it's exactly what we do in TW. They need to be citable, and their combinations need to be citable, and assertions between protonyms need to be citable. Being citable means linking them to a timeline by proxy of the year of publication linked in the Citation. So too do OTU ids need to be stable. Without curation workbenches that do this for you, you are, like I mentioned above, up *@$ creek. Even if those workbenches do it for you you still have issues with merging across GSDs that I suspect can not be resolved without specific new assertions or fun biological computation.

Perhaps this belongs in more of a blog format, but the coffee is flowing, so I'll post it here.

Reading back @rdmpage says " I was imagining that, again, we could have a hierarchy of identifiers.". I think we agree, but I look at it from a different angle. You don't even need a hierarchy bit (at the core), you just need an anonymous ID for stability. In WikiData the Q1234 is just fine, or maybe B12354 (where "B" is biological concept). Mint it, and surround it with facts. Any hierarchy can be added as an assertion, but it's not a central organizing principle. There, you have stability in as much as WikiData is stable, around which a concept can grow. The bonus- the WikiData identifier is resolvable, and the data there collapse down to computable statements. That concept has relationships to names, biological data, other "B"s. That's it. If people reference the Q12345, the concept will strengthen, if they cross-reference it to another system of identifiers, it will further strengthen. There is nothing magic here.

IMO things to avoid if you want to make it better:

Don't bother trying to enforce (or even espouse) one identifier per taxon concept. Plurality is reality.
However, attempt to bias the use of one identifier per concept, by doing good science. Build the strength of a concept (in the broad sense) by giving it rich context. With rich context comes eyes. With eyes come improvements to the data. With improvements to the data comes -> "That QID is good enough for me, I'll use it in my workbench, because it seems useful". Now that it's in my workbench (which also references many other identifiers the curator is interested in, but who cares), I've created a richer context. Iterative/cyclical improvements.
When a new Q concept emerges that seems to be a biological taxon quickly add attach biological data to bias that concept to being thought of as biological, rather than some weird Frankenstein of biology and nomenclature.
Be OK with deprecating Qs, but do that at an external organization level. Imaging Q1 -> A gull. G2 -> A gull almost identical to Q1, some think it is, some think it isn't. Both can exist, that's fine. An external org/agent etc. makes a decision to reference one or the other, or mint a 3rd if they want. When an agent/org mints a new list, they reference Qs, that reference builds a set of data that is returned to WikiData. Now we have richer context (everyone is using Q1, but almost nobody references Q2). We can ask why, etc., and refine or mint new Qs, or we can just drink the cool-aid and accept Q1, because everyone else is doing it, and we trust them. IMO the only way out of this approach and its problems is to compute on the facts (QS attached to QS that are specimens, not Qs that seem to be biological taxa).
Never, ever, ever embed information in the identifier, even prefix "B" (biological concept) over "Q" (thing) is likely a bad idea. Years in the ID? Terrible. Hierarchy or nestedness? Ugh. Relationship belong in object properties (links between instances). Identifiers with biological names in them are the absolute worst. People have to internalize that identifiers point to concepts which are the nucleus around which data accumulate, nothing more. Note that WikiData uses Q, and a couple other prefixes, that's it. That should be a big hint to those thinking about identifiers. This approach will only be learnt by teaching upcoming generations of students/workers.

I love the idea of seeing WikiData IDs seep into all the nooks and crannies, they are so simple. We just have to build the practical interfaces to it such that curators/taxonomists/scientists can draw from those IDs, and integrate them into concepts they work with on a day-to-day basis.

Getting back to CoL. What could be done?

As a matter of policy, encourage, slowly, but ultimately more forcefully, GSD providers to provide OTU ids. The CoL is after all a list of biological species. If GSDs can't assert the circumscription of biological entities, but rather just nomenclatural relationships, they aren't doing their job.
As a matter of policy, encourage, slowly, but ultimately more forcefully, that those GSD OTU ids be WikiData Q numbers.
As a matter of policy, reject the concept that we will always have data from providers that do not have OTU ids. It is not OK for the CoL, IMO, to accept that some providers will just provide Word documents until the end of time. Make it an educational policy, with support from the CoL, to get the data out of those formats, and into one of the many possible better alternatives. I feel, to date, there is far to much complacency in this regard.

IMO any other effort by the CoL is treating the illness, rather than going for the cure.

@mjy we want an algorithm that computes concept equality on the basis of stable name ids and the homo- and heterotypic synonymy given by a GSD. TW is very different that it is an editorial system. Versioning is simple when you can intercept record based changes. But imagine every change is done by bulk uploading thousands of taxa and names. You need to figure out what has changed and if it's a relevant change unless you want to version each and every record all the time.

You can rely on stable ids from outside (WikiData, IPNI, Avibase, GSD IDs such as in WoRMS or TW, you name it) that the GSDs (re)use and then blindly trust them. But this is wishful thinking right now and we would have to drop large parts of the catalogue. The CoL is an established project that we need to continue. Even if we trusted ids from the outside they would not follow the same rules and be very different in what they mean. The CoL is an aggregation of heterogeneous sources.

Thats why we decided to issue our own CoL ids (as CoL always did), based on some computable algorithm. The Taxon ID discussed here is something we have not started with, so details will only come up once we do so next year.

And really it's the same for name ids. Does any change to the record generate a new id or do we attach ids to the idea of a published name (usage) that is fixed, but for which we can change the name records "metadata".

And like I said above: The basic version of such an algorithm would just look at the set of types included in the synonymy to define the concept. And in the absence of good type coverage the protonyms will be used as type proxies. Such ids might not be perfect, but have a clear definition, are stable and an improvement over pure name ids (which we also have as a different way to link to CoL).

@mdoering "then, blindly trust them. But this is wishful thinking right now" - So no identifier is good enough, so you'll mint your own, based on data that contains "identifiers/names" that are not good enough, and some algorithm that pulls new facts out of the air. Then, on top of that you are then asking others to trust your new identifiers and the decisions that come from them... but not those other ones. I see no problems there ;).

You're still thinking of "changes". There is no versioning, it's only accumulated facts, that's the principle TW uses. This is precisely the core of a data model CoL needs ultimately. How it populates that model is the real tricky bit (thus this issue). My argument, and I"ll drop it, is that you can't get much farther than you do right now unless providers improve their data.

What do you mean by type? Specimens? Type specimens don't define biological concepts, that is a old, well known fallacy. Type specimens anchor name priority, that's it. It's a different edge in the model (Specimen -> Name, it has nothing to do with Specimen -> OTU/Biological concept). Overloading their meaning will lead to nothing but pain in the long run ;).

All: I woke up this morning to an inbox full of really interesting and exciting posts within this thread. You all know me well enough to know that I cannot remain silent. So the only practical option for me was to read the thread in sequence, and comment accordingly. Apologies in advance for re-stating points already made (think of them as "+1"s), and for the length.

@rdmpage :

except I would junk 7 and 8.

That is basically the same conclusion I've come to over these past couple of years. It may eventually be possible to develop these areas (identifiers/classes for circumscriptions and broader "concepts"), and/or maybe other groups are better suited to pursue it than the usual gang of suspects (myself included) that keep repeating these conversations across many years. I think they have potential value, and I wouldn't shut the door on them completely. But I think we need to walk before we can run, and at the moment we're (still!) in the transition stage between crawling and walking. There have been a bunch of conversations along these lines in recent months among the tdwg/tnc group.

So, this leaves #5 and #6, namely "protonyms" and "usages" (I'm taking #1 - #4 as essentially given, maybe subject to tweaks).

Yup. Same here. Reaching back to the language I used in that in related comments, Protonyms are the content, and Usages are the context. Both are the same class of "Thing", because both have the exact same properties. However, distinguishing Protonyms (as a subset or subclass of of all Usages; see tdwg/tnc discussion) is useful not because they represent a distinct "thing", but because they can serve in a special-case (and fundamentally important) kind of relationship with other Usages. This solves the issue raised in your recent iPhylo post. But please don't use the word "species" in this context (i.e., "...the importance of stable identifiers for species...", etc.). For every ten people who read your post, there will be 12 different ideas about what that word means in this context.

First, every protonym gets a nice, human-readable identifier, for example a combination of species epithet, author, and year.

Sure. Aus bus Linnaeus 1758. If you want to be really consistent and unambiguous and explicit, you would structure that identifier as Aus bus Linnaeus 1758 sec. Linneaus 1758. There are pros and cons to qualifying protonyms that way, which I'd be happy to elaborate on in another post, if asked. In either case, we should call it a "canonical name-string" or something like that, so that it's immune to spelling variants, qualifiers, abbreviations, etc. that might have been represented on the actual page within Linneaus 1758. But please, PLEASE don't assume that our electronic database systems will use these same human-friendly identifiers for internal identification purposes (e.g., foreign keys, or even urls). That would be a really bad mistake (see below).

Linked to this identifier is every homotypic synonym of that name ... This is essentially #5 (I think).

Yup. Exactly. See: 10.5281/zenodo.59790

Then imagine that same identifier is linked to every "usage" (name + reference pair) that we consider to be relevant, including heterotypic synonyms. This would enable a user to generate things like the current name and all synonyms, as well as go back and generate a snapshot of what the taxonomy was in, say, 1990. I think this is basically an aggregation of #6, and is close to the notion of a taxon concept being an "according to" statement.

WOW! FINALLY! Do you have any idea how long I've been waiting for someone else to write something like that? Seriously... THANK YOU!

One could imagine an interface (both web and API a bit like): ... /n/aus-fred-1909

Ugh. OK, well I can certainly imagine a service that takes those three parameters (epithet name, author, year) and finds how many matches there are. If only one match, it could function as an identifier and provide the relevent record. But based on content already in GNUB (202K Protonyms initially established as full species), about 7,000 (~3.4%) are non-unique across these three property values (original epithet orthography, authorship string, year). Granted, that's a small percentage -- but even 96.6% unique is pretty pathetic in the realm of "unique identifiers". (Fun fact: the author Malm described 24 different species with the name "linnei" in 1877; per ZooBank).

As I've pointed out many times, the amount of complexity needed to come up with an identifier for this sort of thing that is both human-friendly and unique vastly exceeds the complexity of having opaque identifiers (e.g., UUIDs) that are used by the computer for true identification, and then simply renders the results back to humans with a human-friendly label.

But that aside, yes -- we've already built and tested services of the sort you described. But the funding ran out before we were in a position to turn them into accessible APIs. That circumstance is changing (rapidly), so we may get these APIs up and running after all. Watch this space.

Everything else (actual "content" of each taxon, implications for characters of taxa, etc.) are all things one could compute from the classification if you wanted, but I think these are really separate things.

I absolutely, 100% agree!

If, for example, the identifiers were DOIs, clean and human readable

I know you love human-friendly identifiers, and I get that. But life is SO much easier if you have computer-friendly identifiers, then represent them via human-friendly labels whenever human eyeballs are in play. DOIs are WONDERFUL because of the rich dereferencing/resolution services. But they suffer the same fate as PURLs and other similar sorts of identifiers in that they conflate identification with dereferencing/resolution mechanisms. The best of all worlds can be achieved when you mint UUIDs as identifiers, then wrap them in a DOI prefix (making them dereferencable/resolvable), and then create a standard format for constructing a human-friendly label. The PLAZI/Zenodo team almost gets it right, in that they issue UUIDs to Usages (=Treatments), then Zenodo mints DOIs for them. Unfortunately, Zenodo doesn't embed the UUID within the DOI, so we have yet another identifier to track. For example: http://treatment.plazi.org/id/03EA878F-FF95-FFA5-4F81-1B00FB0E6CA9 sameAs http://doi.org/10.5281/zenodo.3806768

Sigh....so close....

@mdoering :

There are 5 concepts in those 6 usages in the diagram which I would really like to attach 5 different ids to.

I believe I recognize the handwriting/chicken-scratch in the whiteboard diagram as my own (and I certainly remember the animated discussion). The problem is not that we don't have enough identifiers. The problem is that we have too many. All of the boxes on that diagram get a separate TNU identifier. The problem is that there are dozens/hundreds/thousands of potentially relevant other TNUs for congruent circumscriptions, each with their own identifiers, and it's not clear which one to use. For example, there may be many publications that all represent the same set of heterotypic protonyms as shown in box 5 of the whiteboard diagram. Which one becomes the "identifier" for that box? The chronologically first to synonymize A. xus under A. bus? The one who provided the most robust taxonomic treatment? What people seem to want is a single identifier for "Box 5", which may serve as hub for dozens of individual TNUs all asserting the same/congruent taxon. This is where "TNU as surrogate for concept/circumscription" gets messy, and requires a third party to "elevate" one of the TNUs as the surrogate/proxy for the "box". But in any case, the problem isn't a lack of identifiers to use for concepts. The problem is an overabundance of them.

Also the main problem here for me is how do I know when looking at the data that the concept has changed? Ignoring identifiers completely as we want to (re)assign them in the process. What warrants a taxon change and when does it remain the same? We need to find objective rules what the algorithm has to do. It also gives a real meaning to CoL ids as they are based on objective rules.

Here is where I think we keep getting hung up. In order for "a concept" to "change", we need to come to some agreement as to what "a concept" is. How can you know whether it has "changed" if you don't even agree on what it is? Walter Berendsohn used the term "Potential Taxon", for what I called "Assertion", and which we now refer to as [Taxonomic Name] Usages. Every TNU represents a potentially different taxon (concept/circumscription). But depending on how one defines "taxon" (i.e., my #7, which both @rdmpage and I have decided is not tractable - at least not at this time), different people would use different mappings of which individual TNU instances map to which individual "taxa". So to say that "a concept" has "changed", we first need a definition for what "a concept" is, and even after we achieve that, it's often the case that insufficient information exists (within the publications, within our databases) to even know if the concept has changed. In theory, this would be wonderful. In practice, it's going to be a while before it can be meaningfully implemented. I think @nfranz understands this realm far better than anyone else, so I would defer to him on that point -- but the sort of stuff he has done explores the potential/power/limitations of this space. Personally, I find it both exciting and scary at the same time.

Reading further down the thread, I think @rdmpage nailed it with:

I think "when does it change and when is it the same?" leads to madness.

He also nailed it with this:

Every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to?

+1 (is it possible to add a "+5"?)

@mjy:

Link your biological data to OTUs (anonymous entities linked to nomenclature)

I would say "Link your biological data to TNUs" (each of which represents an explicitly defined or implicit OTU). Are we saying essentially the same thing? The nice thing about doing it through TNUs is that's often how it happens in the real world. Someone has an organism in-hand (biological data), and assigns it to a name by referring to some (usually published) definition of the name (field guide, key, etc.). The exceptions are the expert taxonomists who just "know" what species it is. But in such cases, they simply need to point to a TNU that represents the taxon in the same way they "know" it to be.

Stack citations, however many you want on any concept (e.g. OTU, Protonym, Franz graph relationship, relationship between OTUs, relationship between Protonyms, etc.). This is your timestamp proxy.

OK, so maybe we're not the same. I've recently had very long discussions with Kevin Thiele about exactly this issue (we even refer to it as "stacks" of TNUs aligned on a single "concept"/"circumscription" instance). But see my comment to @mdoering above: coming up with a shared definition for what these name-less taxon entities are, is the real barrier.

Flesh-and-blood-and-celluslose-and-cytoplasm Organisms exist in nature. Taxa do not. Taxa exist in the minds of humans. Humans communicate information about taxa (and the mappings between their imagined taxa and actual organisms) via text-string names usually embedded within publications (or other references). The text-string names are usually what get indexed in databases. But the name-in-context (e.g., "Aus bus Linneaus 1758 sensu Pyle 2020"; AKA a TNU) is the most effective and practical way to reference the interface between names, organisms and OTUs/taxa.

For what it's worth we have 100s of thousands of taxon names, OTUs, specimens, citations, and identifiers following this approach in TaxonWorks, i.e. it's not an imagined approach.

Substitute "GNUB" for "TaxonWorks", and I can make exactly the same assertion (and more than just specimens -- in fact, most of the organism occurrence instances are observations).

Back to @mdoering:

Usages are much more friendly to work with if you think of publications. Then they are immutable and you have a limited number of them per name. But for dynamic databases it becomes a different problem.

Yes, which is why I separate out the static TNUs from the dynamic Meta-Authority assertions. See, again, this publication, page 34, starting with the heading "Accepted status".

The system has to decide when to create new identifiers. In this sense every monthly release of the CoL would be a different usage and has its own set of taxon ids.

Not necessarily. Even if you can't stomach the Meta-Authority approach (where a new identifier is needed only when a particular perspective changes), you can just only issue a new identifier when it changes in substance (different synonymy, different classification, change in circumscription, etc.; more detail below) from one month to the next. Effectively each month's cut becomes a change log. The cut can include the full dataset, but the identifiers only change when the relevant content changes. You still need to define what properties within CoL warrant a new identifier; but I would suggest that you only change the identifier when the classification changes (including placement of a species epithet in a different genus), or when the set of heterotypic synonyms changes. If you try to get more granular than that, I think you'll be on the path to madness that @rdmpage alluded to.

CRAP! I just got to the post from @rdmpage that includes:

Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same.

OK, replace all of my paragraph above with "+1" on that post from @rdmpage . I could have deleted it, but what the hell -- maybe it says the same thing in a slightly different way.

My goal is to start with the type specimen as the anchor for a taxon, but refine that for splits and merges by comparing adjacent taxa and their types.

Yes, this comes back to the conversation we had in the living room of @dremsen. Use heterotypic synonomy sets as your computable mapping to when a new identifier is needed (i.e., protonyms as proxies for type specimens). This is imperfect, of course, when you don't have heterotypic synonyms listed, or when you need to divine the relationship between an earlier treatment and a later treatment (in the diagram, Aus bus sec. 1960 to Aus bus sec. 1970). But honestly -- without a @nfranz -style analysis (which itself is still ultimately subjective), you can't ever know whether Aus bus sec. 1960 maps to Aus bus sec. 1970; or maps to [Aus bus sec. 1970 + Aus fus sec. 1970]. In other words, you can't know from the data we generally have at our easy disposal whether Aus bus sec. 1960 was "split" into Aus bus sec. 1970 + Aus fus sec. 1970, or whether Aus bus sec. 1960 is congruent to Aus bus sec. 1970. Someday, when the @nfranz approach has been fleshed out across all of taxonomy, then these sorts of questions will be computable. But until then, it's probably best not to go down that rabbit hole.

Ooops!! I just now read the next post:

Well, the base for the taxon is the set of types. [etc.]

I almost deleted the stuff I wrote above as redundant, but you can instead just treat it as a "+1".

@dremsen:

This is the direction I also favor.

I hope so! It was your living room, after all! :)

As to the set of posts related to this from @mjy:

Can anyone answer whether a taxon concept sensu the title of this issue has biological attributes?

@mdoering already answered exactly the same way I would, so I'll simply say +1 to his reply.

@rdmpage:

I'm still waving my arms around here (can you tell? ;) ) but I do wonder if part of the problem is seeing things as boxes rather than as timelines.

Another "WOW!" (+5). My arms are downright exhausted from years of waving in the exact same way. So, I already addressed this a bit above, but if those boxes represent specific/individual TNU instances, then I'm 100% onboard. If they represent abstract notions of name-independent taxa into which stacks of TNUs are folded, then I start to get a bit more dizzy. Again, I think the "set of heterotypic synonyms using protonym identifiers as proxies for type specimens" approach is (by far) the best path forward. Yes, some of the s.s. vs. s.l. distinctions will fall through the cracks, but those can be addressed later when we all catch up to @nfranz on this stuff. Whether we need to mint singular identifiers (of a different class) to represent sets of ProtonymIDs (vs. simply using the array of heterotypically synonymous ProtonymIDs as itself the mechanism for uniquely identifying the boxes) is, I think, an implementation question. I'd only advise exercising caution before dumping a new class of identifiers on the world, because you know it will be badly misunderstood and misused by the masses.

Back to @mdoering:

Yes, we will have different ids for a name and a usage.

If by this you mean "different classes of identifiers", PLEASE consider this carefully. I went through a LOT of painful mistakes when I did the same thing back in the 1990s; and when I saw the light that Protonyms are a subset of TNUs, the implementation side got MUCH easier.

Protonyms ARE TNUs; they're just a special subclass of TNUs. They have the same properties as TNUs. In 99% of cases, the Protonym is of the form "Aus bus Linnaeus 1758 sec. Linnaeus 1758" (there are exceptions, but mostly confined to old names that were first established in a non-Code-compliant way, then made available later -- this is something that should remain within the realm of nomenclators).

If you start minting different identifiers for the "Protonym" of Aus bus Linneaus 1758, separate from the "Usage" Aus bus Linneaus 1758 sec. Linneaus 1758, you will almost certainly regret it. At first glance it seems like the same identifier means different things depending on whether you're referring tot he Protonym of the name "bus", or the taxon concept asserted by Linnaeus in Aus bus Linneaus 1758 sec. Linneaus 1758; but I promise that this distinction is just an illusion. It would require more text than I've already written above to explain why this is so. But I can share some of the LONG emails I had with Kevin Thiele, if you want.

That raises again the question which properties exactly belong to a usage.

Just to continue and expand from what I already wrote above, I have been using these four properties to represent a "change" in an objectively identifiable way:

1) Classification (i.e., immediate hierarchical parent; not the full hierarchy to the top) 2) Set of ProtonymIDs representing heterotypic synonyms 3) Rank (e.g., full species vs. subspecies) -- this is essentially redundant to #1, but not always (e.g., when you go from Aus bus subsp. cus to Aus bus var. cus) 4) Orthography (exact literal UTF-8 representation of the epithet only; not the combination)

You could also add:

5) Reference/TNU used as a anchorpoint/basis -- such as when a new publication comes along that doesn't change any of the four properties above, but provides a much more robust diagnosis/etc. and thus represents a "meatier" foundation. But for computational purposes, this doesn't really add anything. For end-users, it might (and that would also bring it a step closer to the Meta-Authority model).

On the whole "versioning" thing, I think the immediate/important questions most people want to answer are:

1) What is the status right now from the perspective of my favorite/trusted Meta-Authority (e.g., CoL)?

2) What are the various perspectives in the literature for a given Protonym over its past history (including the alternative "current" treatments/views that differ from my favorite/trusted Meta-Authority)?

I think most people are a lot less concerned with "What is the history of how my favorite/trusted Meta-Authority has changed its views over time? Sure, that information should be tracked, and is interesting in some contexts but it seems more of an implementation thing. The "versioning" approach is one way to do it, but that requires new identifiers. The way GNUB handles it is with a robust audit trail (literally every change of every field in every record is logged with a timestamp and responsible party, so there is no "version" per se, just a timestamped change log for each record).

@mjy :

You have been tasked to do the impossible by the CoL.

In some senses I agree, but there is a really, really, really simple thing that CoL can at least encourage GSDs to do, and implement itself when the content exists (e.g., content through WoRMS and other robust GSDs), which is simply track one more piece of information for each record, which is "Reference we follow in making our assertion about current status". In other words, the bit after the "sensu". If you can just get that much information, it would be a quantum leap in the utility of the data CoL provides. And even if only a minorty of content providers can offer this information, you can always skip that step with a place-holder sensu someobody but we're not sure who approach, so at least the operational data model is functioning at the TNU level, not just the Protonym (or vague "name") level.

A big "+1" on all the rest of what was included in this post from @mjy (as well as several "+3"s and "+5"s!)

Also, LOTS of "+1"s, "+3"s and "+5"s (especially "Never, ever, ever embed information in the identifier...") in your follow-up pseudo-blog post.

As a matter of policy, encourage, slowly, but ultimately more forcefully, GSD providers to provide OTU ids.

I'm not sure it's the same, but I've been pushing hard (including above) for CoL to get the GSDs to provide a reference anchor point for each asserted "current status". We should move beyond the approach of "sensu GSD Year", and move towards "sensu Publication". Most GSDs are not practicing actual taxonomy within their databases; rather their databases usually serve as value-added indexes of what's happening in the literature.

that those GSD OTU ids be WikiData Q numbers.

Meh... I'm not sure that's the right choice. But I may be ab outlier in that.

What do you mean by type? Specimens? Type specimens don't define biological concepts,

Individual type specimens don't, but sets of types (as proxied through ProtonymIDs expressed as a heterotypic synonymy) most certainly do! I was at a meeting held at Smithsonian back in the 1990s, where this basic topic of discussion was focused in the context of FGDC Metadata Standards (of all things). Walter Berendsohn and Stan Blum and Bob Peet a few of the other early workers in this space were there. I outlined different levels of granularity with which one could define the boundaries of a taxon concept/circumscription:

Sets of individual organisms (e.g., explicit material examined)
Individual populations (usually proxied by geographic distributions)
Sets of individual characters (morphological and/or molecular characteristics)
Sets of type specimens, including among a heterotypic synonymy, as proxied by Protonyms (I hadn't yet coined that word in this context, but that's what I meant)

The last of these is obviously the least granular, and some might argue that (therefore) the least useful. But in the 2+ decades since then, it has become more and more obvious to me that defining taxon circumscription boundaries through sets of type specimens (proxied by ProtonymIDs, as included in an asserted heterotypic synonymy). As my wife once said, "It's better to be vaguely correct than precisely wrong". And while sets of heterotypic synonyms (as proxies for their corresponding type specimens), while vague, are almost purely objective in nature, and as such are in the realm of "facts" (I strongly support the point by @mjy about assembling and growing set sets of objective facts). Also, one can never enumerate, extrinsically, all of the individual organisms (recently dead, still alive, and yet to be born); so there is always an implied non-explicitly-enumerated set of organisms that should be included within the circumscription. I've also never been a fan of the character-based approach, because you always get the odd mutant individual that happens to lack some key diagnostic character which, technically, would fall outside the circumscription (even if both its parents fell within).

Even if no heterotypic synonyms provided, you can still infer the scope of the circumscription as inclusive of all organisms up to but not including the most recent common ancestor of the nearest relative/protonym/type specimen that I regard as *noT8 within the circumscription (i.e., the other related taxa recognized as valid). For those of us who are OK with paraphyletic taxa, it's a little more complex (but not much).

Anyway, this same basic idea was fleshed out in even more detail with @mdoering and @dremsen in the latter's living room (same gathering that produced the whiteboard image posted at the top of this thread). We were close then, and we're still close now. I keep participating in these conversations (as well as the ones happening in parallel in the tdwg/tnc group, and elsewhere), because I keep hoping that maybe "this time" we'll actually have a breakthrough and reach consensus. I had almost given up all hope, but I have to say that both this thread, and the direction happening over at tdwg/tnc, has boosted my optimism that maybe -- maybe -- we're getting close to consensus on some of this stuff!

Phew, that diatribe took me from breakfast all the way to lunch! Again, sorry for the long post, but there was a lot to cover from what y'all wrote while I slept.

P.S. If I didn't quote/comment on the above, then you can pretty safely assume that I'm a "+1" on the rest of the comments in this thread.

Lots to think about here, and I've some reading to do. As a side note I wanted to comment on identifiers. There are bigger hills to die on, and I know I was just begging to be slapped for bringing up uninomials as identifiers - see also comments on Taxonomic concepts: a possible way forward, - but a few thoughts (and I don't want to derail broader discussion, feel free to completely ignore this).

By "hierarchical" identifiers I had in mind the notion of URLs as API, that is, how would someone query the data, and couldn't those queries be expressed as URLs that also serve as identifiers? This leads to a clean interface that gives people the answers they are looking for, and a way to automatically cite the identifier for that information.
I also wanted a way to emphasise that I don't think all the concepts being discussed are the same thing. For example, the whiteboard diagram could be interpreted as six things that are all of the same type, whereas I see three paths (graphs) with some points (nodes) along the way. It seemed easier to make that case if I used identifiers that explicitly identify wholes and their parts. Bit like having identifiers for journals, journal issues, articles, and parts of articles.
I'm not particularly wedded to the notion of uniniomials (or some variation on them), my motivation there was to have something that is human readable and familiar (for example, so I can use them when doing diffs between trees and quickly understand what is going on). Despite what people may think, I suspect having short, friendly identifiers matters when trying to sell the idea to people, and it also means we can draw on earlier discussions in the field where people have confronted the problem of identifiers for species. There's a literature that goes back at least to the 1930's, and has been revived in the last few decades in the context of the phylocode. In other words, when presenting these ideas to other taxonomists we can say "look, this is an issue that our field has known about for a long time, it's not just the ramblings of a few computer obsessed geeks trying to make your life difficult".
I think the notion of opaque identifiers is often misunderstood. It's not that identifiers shouldn't contain information, it's that a consumer shouldn't expect to be able to interpret that information reliably. In other words, if I have an identifier such as a SICI or a DOI that contains an ISSN, it is likely that the ISSN is the ISSN of that journal containing that article, but it might be since changed. If I have an identifier that contains an integer n, it's likely that an identifier with n + 1 is more recent (e.g., Wikidata), but this need not be the case. It's not an injunction to not embed information, it's a warning not to interpret the identifier as informative.
I think some have interpreted the notion of opaque identifiers as grounds for having obfuscated identifiers, such as UUIDs. In other words, let's make damn sure people can't interpret the identifier (and there maybe good reasons for that). I think arguments that identifiers are only designed to be read by machines not by people miss the point - in order for identifiers to be useful they have to be adopted by people, be they developers, users, etc. Identifiers such as DOIs have gained widespread acceptance partly because they are highly visible, and in most cases pretty easy to read. You just have to look at the number of time publications break identifiers embedded in text (e.g., UUID based LSIDs, long DOIs) by inserting line break characters in the middle to realise that the choice of identifier syntax matters.

I guess I'm arguing that it is easy to be dogmatic and say that:

Identifiers should always be opaque
Identifiers should only be designed for machines not people
Identifiers shouldn't be hierarchical

but I think things are more nuanced than that.

Anyway, back to reading the stuff that matters...

Let me just remind everyone that this issue is about what defines a taxon concept in the CoL. The definition of a unique taxon concept in the CoL defines what algorithm we need to compute equality and this stable ids.

The problem is not that we don't have enough identifiers. The problem is that we have too many. All of the boxes on that diagram get a separate TNU identifier. The problem is that there are dozens/hundreds/thousands of potentially relevant other TNUs for congruent circumscriptions, each with their own identifiers, and it's not clear which one to use. For example, there may be many publications that all represent the same set of heterotypic protonyms as shown in box 5 of the whiteboard diagram. Which one becomes the "identifier" for that box? The chronologically first to synonymize A. xus under A. bus? The one who provided the most robust taxonomic treatment? What people seem to want is a single identifier for "Box 5", which may serve as hub for dozens of individual TNUs all asserting the same/congruent taxon. This is where "TNU as surrogate for concept/circumscription" gets messy, and requires a third party to "elevate" one of the TNUs as the surrogate/proxy for the "box". But in any case, the problem isn't a lack of identifiers to use for concepts. The problem is an overabundance of them.

Surely there are many ids and even more usages out there. But that is not what the CoL is about. Our main problem is comparing previous editions of the CoL with the latest version about to be released and to assess under which id to release the taxon under.

The other use case is the Clearinghouse, where we keep many external "checklist" datasets that can act as a source for the CoL, but don't have to. Theses lists (mostly taxonomic trees) come with their own usage ids and we retain them (in contrast to GBIF ChecklistBank where new integer ids are issued). In order to navigate across datasets we have a names index that allows to find the same name across datasets, even, for example, if the authorship was spelled slightly different. Similarily we want to establish a taxon concept index that can be used to find equal concepts across datasets without requiring them to use the same accepted name. I am well aware there are many definitions for both a unique name and taxon concept. For very valid reasons. But for our implementation we need to select one definition that can be used to setup the names and concept index.

As said before, as a starter we will probably try to use the set of protonyms to build the taxon concept index. We are not trying to perfectly model the world of taxonomy and publications. We need something workable in a reasonable amount of time.

As for the style of identifiers we want to use see https://github.com/CatalogueOfLife/backend/issues/491

every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to?

+1 (is it possible to add a "+5"?)

In that case @rdmpage should be happy about CoL and ALA issuing new identifier all the time in every release. But most people including Rod seem to hate that.

Yes, we will have different ids for a name and a usage.

If by this you mean "different classes of identifiers", PLEASE consider this carefully. I went through a LOT of painful mistakes when I did the same thing back in the 1990s; and when I saw the light that Protonyms are a subset of TNUs, the implementation side got MUCH easier.

That is one thing I would like to rollback if I could start again. Separating names and usages seems more of an idealistic thing. So far I do not see any benefits over just having NameUsage instances that have joined properties. And the implementation got way more complex with having names and usages separated.

@mdoering

every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to? +1 (is it possible to add a "+5"?)

In that case @rdmpage should be happy about CoL and ALA issuing new identifier all the time in every release. But most people including Rod seem to hate that.

Because I think the identifier most people will want is a set of usages, not any particular one. A bit like this thread, I can point to an individual comment https://github.com/CatalogueOfLife/general/issues/6#issuecomment-678767729 or the whole thread https://github.com/CatalogueOfLife/general/issues/6. My view is that in most cases, the whole thread ("taxon") is what people will refer to, they'll refer to a comment ("usage") if they feel the need for that level of specificity.

I think this is why people like to link to names, they have enough specificity (that name) and yet enough slop (all mentions of that name). I think ideally taxon ids would have a similar attributes, perhaps with more resilience as they needn't change with changes in name. Otherwise there is limited incentive to link to them (a lot of the work I did in 2018 to link to ALA is now broken because ALA doesn't value identifier stability as much as I do).

@mdoering

Let me just remind everyone that this issue is about what defines a taxon concept in the CoL. The definition of a unique taxon concept in the CoL defines what algorithm we need to compute equality and this stable ids.

OK, we've had our fun now. Apologies for hijacking this thread.

Regarding the specific issue you sk about, can I suggest framing it slightly differently? Presumably you have a classification already (CoL-now). Based on aggregating the data, you have a new classification (Col-future) that you want to release. You want to assign identifiers to taxa in that new release.

For example, currently you have for Opisthotropis balteata (Cope, 1895) the id http://www.catalogueoflife.org/col/details/species/id/e5b7c4081a35d451a9c187e327793765 based on the Reptile database for 2015-12-15. When you ingest the latest Reptile checklist you'll find this is now in the genus Trimerodytes. I would retain the current identifier e5b7c4081a35d451a9c187e327793765d despite the name change - it's moved genera, but in some sense is still the same thing (for various definitions of "same", other definitions are available). Likewise, in most cases like this I would NOT change the id for the genus even if it gains or looses species, as far as the edit script is concerns those nodes don't change.

So, in practical terms, I would do a tree diff between the two classifications to find the minimum number of edits required to convert one tree into another (deletes, inserts, moves). Inserts are easy, that's a new taxon, that's a new id. Moves are typically species from one genus to another, I would retain the same id. Deletes are easy, they no longer exist (kidding). Deletes are likely to be that are newly synonymies names, but I think a way to do that is have the synonym as a child of the accepted name (I think you've done this before when I talked about tree edits a while back).

Now I know that most of this doesn't match the "taxon concept relationship" discussion about how much does something change before it's considered new, but I think most of that is intractable (hence this thread). But I think arriving at a release where the minimum possible number of identifiers change is going to be welcomed by those who link to CoL. The tree diff approach would also enable you to explicitly generate a list of changes (i.e, release notes). In a way by framing it as an information management question (what is the minimum number of operations to convert one tree into another) you can side-step the biological arguments -thus pissing off everyone equally ;)

Hope that is more on topic.

Thanks @rdmpage, that is indeed what I am looking for. A tree comparison is rather difficult on that scale, but let's try that out.

The requirements for a solution are:

it needs to be computable based on the data we have. This guarantees a consistent approach, allows users to understand when and why ids change and also have data at hand that explains the change. The id does not change for some opaque reason that is not encoded and visible directly in the data
it should be more stable than a name based id. The use case is to provide an identifier that moves along when the accepted name changes.

Solutions that come to my mind: 1) name based ids - the baseline. This is what we will start out with this year 1) protonym based - stick the id to the protonym and use it for its currently accepted name. This seems to be the same as @rdmpage describes in the tree diff. It requires knowing the basionym, see below 1) protonym set based on analysing the entire synonymy - requires knowing the basionym 1) name with direct parent taxon or even the entire classification. This leads to less stable ids than the name alone. But maybe it is important for users to have a different id if the classification has changed?

As CoL traditionally has not asked for the basionym of a name, it will take a while until we get that information for the majority of names. It is unlikely we will know it ever for all names. But we can augment the GSD information with nomenclators or even other datasets? It is also often rather obvious from the authorship and can be (provisionally) inferred in large number of cases

@mdoering Makes sense to me. Getting basionyms will be a hurdle in some cases, but often guessable from the names (as you've been doing for the GBIF taxonomy), and some databases (IPNI and IndexFungorum explicitly link to basionyms).

If I understand the tree diff approach correctly, then really the only new ids would come from adding nodes (taxa). Moving nodes doesn't change ids, only their relationships change. This makes life simple, but is unlikely to please those who regard taxa as defined, for example, by extension (set of descendants). Perhaps a solution is to store the edits made, so that you can retrieve each node affected by an edit (e.g., a species moving from one genus to another is a deletion from one genus and an addition to another). People could then subscribe to that series of edits and update their own definitions of taxa accordingly.

But back to the topic, regarding scalability, I've not investigated the performance of the code I wrote with Gabriel Valiente forest, but I presume it would be straightforward to partition the CoL classification by major taxonomic group in a divide and conquer approach. Of course, there may also be other/better algorithms and/or tools available.

I am often wrong and never entirely right but will use a made-up story to illustrate the key points in my understanding of what should and should not count as properties of a taxon concept when minting and changing identifiers for them for the COL. My story involves three of us, variously fictionalized. It assumes Rich maintains the COL fish GSD and Markus and I are fish biologists of dubious reputation.

I caught a fish. It's a specimen.

Rich and Markus and I all assess my specimen.

Rich looks at it and says "I don't know how you got this snorkeling in Woods Hole but this is a specimen of Chromis abyssus. When Rich does this, and to paraphrase a past Rich Pyle, he is saying "The specimen you caught is conspecific to the type specimen that I collected and is thus congruent with the concept I had when I caught my specimen and formally described it as a new taxon according to the rules of the code." So my specimen, according to Rich, is an instance of that concept. The concept itself didn't change.

Markus looks at the specimen and sees it a bit differently. He insists Rich has misidentified my specimen and that it is actually a different species, Chromis margaritifer. I don't know why Markus thinks this. But Rich's concept of abyssus still has not changed.

Remsen says "yeah, but look at the tail! Rich said nothing about the tail having a spot" and insists that it's a new species. Rich says "Pfft, not sure that's a spot. I have seen them before." and his next revision of his GSD makes no mention of me and my delusions. My concept doesn't count. He does make a notation of Markus' observation when he updates his GSD.

Chromis abyssus, Pyle 2001 (accepted name) Chromis margaritifer Fowler, 1946, sec Döring 2020, (chresonym) misidentification.

In doing so, he is saying that the fish Markus identified as margaritifer is really just another abyssus. It's not a synonym because the specimen was not a type. So it's a misidentification, according to Rich, and the citation is a so-called chresonym,.

But I'm not done. I do some research, some DNA barcoding, and make a bunch of fancy drawings. I write it all up. I put my specimen (holotype) in a jar and publish my paper in the journal, Calodema, carefully following the rules of the Code. According to those rules, Remsen's concept has now entered the realm of taxonomy and the taxon "Chromis hawkeswoodii" becomes a real (short-lived) species.

During his next revision, Pyle's annotated checklist, published through Aphia, begrudgingly contains some new entries.

Chromis abyssus, Pyle 2001 Chromis hawkeswoodii, Remsen 2020 (heterotypic synonym) Chromis margaritifer Fowler, 1946, sec Döring 2020, (chresonym) misidentification. #this is removed by the COL because it is not a 'real' synonym and should not be used to improve recall in search.

Pyles concept of abyssus hasn't really changed. Pyle would assert that the holotype of hawkeswoodii is conspecific to abyssus. As we saw, he did that originally when my fish was just a specimen, prior to me doing all this work and turning it into a holotype. But since I went to the effort to enter my concept and associated specimen into the pantheon of taxonomy by following the nomenclatural rules, his concept has changed. It includes the original 'protonym:' a nomenclatural term which is really a proxy for Rich's novel concept. It also now contains Remsen's concept of hawkeswoodii and it's associated protonym and holotype. Protonym count went from one to two. Concept changed.

This is essentially how I interpreted the litany of taxonomic publications I reviewed when trying to develop an inclusive taxonomic model with computable concepts. I'm not saying it's right. But I will say it was useful for:

Improving recall in search by providing a list of synonyms that will reduce false negative returns in search.
Improve precision by 2a. establishing an index of 'concepts' with distinct 'protonym circumscriptions' (remember these are proxies for concepts that are always imperfectly described in publications) 2b. establishing an identifier that could be applied to to a data object that would distinctly identify one concept labeled with the same name as another, also distinctly identified concept.

@rdmpage we will always keep the history, so you can use the taxon id and go back in time what it looked like in the CoL in a specific edition. So you can get the entire history for a concept as it appeared in the CoL. That allows people to link to just the id which takes them to the most recent version of it. Or they link to a specific edition of the CoL for which results will be immutable. I think that should give users enough freedom to select the kind of id they need for their purpose.

@rdmpage :

By "hierarchical" identifiers I had in mind the notion of URLs as API, that is, how would someone query the data, and couldn't those queries be expressed as URLs that also serve as identifiers? This leads to a clean interface that gives people the answers they are looking for, and a way to automatically cite the identifier for that information.

Yes, I could definitely get on board with that. I guess whenever I see the word "identifier", I immediately jump to a notion that places most emphasis on "globally unique". Among the things I like about databases are precision and a lack of ambiguity. Part of my infatuation with UUIDs is that when I throw something like 8bdc0735-fea4-4298-83fa-d04f67c3fbec into a resolver engine (Google, ZooBank), there is no ambiguity on a global scale exactly what I'm interested in. Another part is opacity, along the lines of the point made earlier by @mjy

However, more in line with your point, I agree with you that URLs as API also function as identifiers of sort. For example, when I emulated your proposed identifier system in ZooBank:

http://zoobank.org/Search?search_term=abyssus+Pyle+Earle+Greene+2008

Sure enough, I got only one result. In fact, the same is true when I limited it to only the first author: http://zoobank.org/Search?search_term=abyssus+Pyle+2008 [Incidentally, I checked for uniqueness in GNUB using only the first author name, instead of all author names, and I ended up with a nearly identical result of 96.6% uniques; so first author is just as good for this purpose as all authors.]

With a little bit of alteration to the website code, I could make ZooBank follow the "I'm Feeling Lucky" principle and go directly to the record if there is only one result. I could also tweak the code to eliminate the explicit (and unnecessary) "Search?search_term=" bit, so the URL could just be zoobank.org/abyssus+Pyle+2008. [NOTE: I stripped the http prefix on non-functional URLs, so GitHub wouldn't create hyperlinks out of them.]

In that sense, the identifier "zoobank.org/abyssus+Pyle+2008" would indeed be functionally equivalent to http://zoobank.org/8bdc0735-fea4-4298-83fa-d04f67c3fbec. I don't think I would go so far as to index "[abyssus+Pyle+2008] sameAs [8bdc0735-fea4-4298-83fa-d04f67c3fbec]" in bioguid.org; but that doesn't mean your point about URL-APIs as human-friendly identifiers that work 96.6% of the time isn't useful. And sure, I could relax my own idea of the word "identifier" to even think of this as an identifier.

As for "hierarchical", I'm not entirely sure I understand what you mean in that sense, but perhaps what you mean is that instead of "abyssus+Pyle+2008", you could start with just "abyssus" (as in, "zoobank.org/abyssus"). In ZooBank, you'd get four results:

Chromis abyssus Pyle, Earle & Greene, 2008
Derolathrus abyssus Yamamoto & Parker in Yamamoto, Takahashi & Parker, 2017
Parabaeus abyssus Austin, 1990
Rhinecanthus abyssus Matsuura & Shiobara, 1989

So then you'd need to go to the next level, with something like: zoobank.org/abyssus/Pyle That would get you down to one result, and a likely winner.

So, having no idea what you meant by "hierarchical", I'm imagining my own version of a "hierarchical" API/Identifier system that starts with the first tier of only the epithet, which by itself would (remarkably) get you only one result about 75% of the time. In the 25% of cases where it's ambiguous from the epithet only, going to the next tier and adding the first author name only will get you a single result about 93% of the time. And, as already mentioned, adding the year expands that to 96.6% singletons. Just out of curiosity, using the year as the second tier (instead of author) yields almost exactly the same result as only the author (93% singletons).

OK, I'm rambling now, and so far have only responded to the first point of the first response to my post, and I see there's a lot more yet to read. And it's not even within the scope of his particular thread, as noted by @mdoering

So I'll stop now, as I need to get ready to go out for a dive with my son; but when I come back I'll read through all of the new posts, and will strive to come up with a MUCH more concise and coherent reply.

OK, I lied. One more reply before I go diving.

@mdoering :

Surely there are many ids and even more usages out there. But that is not what the CoL is about. Our main problem is comparing previous editions of the CoL with the latest version about to be released and to assess under which id to release the taxon under.

This is why I've been pushing so hard for CoL to move to a TNU model, rather than some sort of fuzz "name" model. Like all Meta-Authorities (including all the GSDs that provide content to CoL), it should not be in the business of making statements along the lines of "Aus bus is a valid species" and "Aus xus is a synonym of "Aus bus". Instead, it should be making statements along the lines of "We follow Jones 2019 for Aus bus". Because Jones 2019 treated Aus xus as a junior synonym of Aus bus, the synonymy is automatically inherited from the statement.

On a more technical level, here's how it should work: CoL (via GSDs) should anchor all names of valid species to Protonyms. You already have the content to do this, even if you don't have the full literature citation details of the original description. GNUB can provide the UUIDs to every Protonym in CoL -- I can accomplish that in a weekend or two. As long as the GSDs have their own unique identifier, they don't need to incorporate the Protonym UUIDs because CoL (or better yet, BioGUID.org) can maintain the cross-link index. If GSDs don't have persistent unique identifiers... well, then perhaps it's time to retire those GSDs from CoL (or focus on upgrading those GSDs).

So, CoL then becomes an index of all the world's Protonyms that represent valid species. This Index then needs to have only one other piece of information attached to each Protonym record: The TNU for the treatment that "gets it right" for this taxon.

Yes, I know that GSDs don't provide this information, and it's impractical to get them to do so anytime soon. But my point is that the ProtonymID + AcceptedTNUID model should be the defined endpoint for where CoL should be heading. It will never get there if you don't start exploring the actual mechanism to do so. I agree: it's not at all feasible to apply this to all names across all taxa (and all GSDs). But there is a non-trivial amount of content that it could be applied to. All fishes, for example. At the very least, you could explore this as a "Proof of Concept" approach embedded within a more generalized approach, for the subset of records where ProtonymID+AcceptedTNUID are available; while still maintaining the less effective method for recognizing changes based on the combination of [Immediate Parent]+[Heterotypic synonymy]+[Rank]+[Orthography] approach.

Ultimately, CoL should not be in the business of minting its own identifiers. Instead, it should be a broker of TNU identifiers, putting a "gold star" on selected TNUs that serve as surrogates/proxies for the "box", in which all other TNUs sharing identical [Immediate Parent]+[Heterotypic synonymy]+[Rank]+[Orthography] patterns are placed.

I know that's a long way into the future, but if that is defined as the end point now, it will make the road to get there all the smoother.

Offtopic!

@deepreef TNUs != OTUs. The former are handled in TW by NOMEN+Citations IIRC. OTUs are what WikiData are doing, just an anonmyous QID + data, some names, some not, I think. Requiring nomenclature to define biological concepts doesn't universally work (bacteria, genetic species concepts), so why not abandon this approach from the get go (don't answer here). In TW we embrace OTUs. Users define a list of OTUs to export to their GSD. We crawl the list of OTUs to find out what nomenclature should come with the list.

On topic, but not very constructive

Good luck with the tree diff approach. Note that AFAIK CoL doesn't really manage a classification as I think @rdmpage is envisions they do. Until very recently they didn't even return some of the commonly used ranks. The classification that does exist is human constructed based on the Editor appending sectors onto a tree.

I assume that a more complete classification for the purposes here will be built by algorithm. I assume it will have all the same issues GBIF's does. So take that into account when you assume stability of identifiers embedding information derived from algorithms. For example, one species of tenebrionid appearing in 4 kingdoms by the time it gets to GBIF collapses the consensus, to use another tree-based concept.

Oh, you'll also need to embed versioning into the whole system, as the algorithm will clearly evolve as you struggle to find any use for it. Each commit to the algorithm will render past identifiers for concepts meaningless, as it will no longer have the same rules, and trying to figure out what changed between versions with respect to species concepts will only be useful as a sadistic test for graduate students taking computer science prelims. ;)

@mjy I'm not quite so pessimistic, but don't have data to argue the point. The tree diffs needn't operate on CoL itself, they could be applied to the input classifications from the source databases (e.g., the reptile database mentioned above).

@rdmpage Right, point taken, larger classification not needed. I think given this fact there is nothing preventing the experiment to start right now:

Grab a GSD for each of a small, medimum, and large clade. Repeat 10 times, drawing data from the 10 years of GSD submsissions (these are are variously archived). @gdower can likely help get the datasets if someone wants to try.
Run the experiment. Visualize the results.
Tweak the metric, optimize for maximum stability, repeat.

I.e. there are no bottlenecks beyond time to this experiment.

Back from diving, and lots to think about/discuss. But quick for right now to @mjy "Off Topic", which I actually think is very much "On-topic", because IIUC (not sure if that's a thing, = "If I Understand Correctly"), @mdoering is trying to answer the broad question "When do I mint a new CoL Identifier, vs. when do I modify properties associated with an existing identifier?" (CMIIW, @mdoering ). The simple answer to that question is, "When the concept/circumscription is different!" But that's not a very useful answer, because we haven't yet answered the prerequisite questions, "What is a concept?", "Is it the same as a circumscription, or different?", and more to the point, "What are the core properties of a concept/circumscription such that a change in one of these properties results in an implied different concept/circumscription?"

So, in that context, the clarification that "TNU != OTU" is both very helpful and very relevant to these prerequisite questions.

To start, a bit of clarification of my own. Although the "N" part of TNU is often assumed to be a Linnean-style scientific name (and that's where most of our focus has been), that's not necessarily the only context in which the "N" part applies. There's been some discussion of this over at tdwg/tnc, but I would certainly include some classes of non-Linnean names (and some advocate for opening it to all text-string labels, including vernaculars/etc.) But the point is, Linnean-style nomenclature is absolutely not required for TNUs to work either. But I'm pretty sure that the "T" part of TNU is the same as an OTU (if not, then CMIIW).

So here are some questions about OTUs in this context (i.e., the WikiData notion of it, as adopted by TW):

Do they always have some sort of text-string label associated with them? I'm assuming the QID at least, but is that the only way to cite them?
What properties of the "Data" part help you determine whether you're dealing with a new instance of an existing QID-branded OTU, vs. an OTU that requires the minting of a new QID?

I think these questions are on-topic for the issue sought by @mdoering, because if a CoL "thing" is the same as a WikiData/TW OTU "thing", then understanding the logic behind how new QIDs are minted vs. amended of OTUs might directly address the same question in the CoL context.

PS, Before I wrote the above, I didn't know that "CMIIW" was a thing, but evidently it is. I also just now learned that IIUC is a thing too.

Your simple answer is useful, Rich, because it's a good start. You mint a new identifier if, and only if, the concept changes. Anything else and your identifier must be referring to something else. A concept changes when something is added to it or removed from it. What is that something? It's clear that one answer, at least, is that a concept changes when you add other taxa to it or split taxa (new or previously included) from it.

Warning, off topic sensu my take on requirements for #6, includes themes repetative with previous spewing by me

@deepreef we all have ideas about what how identifiers should be minted for OTUs, @dremsen's ideas are perfectly fine. We know that we need new IDs for new concepts.

Any aggregating system in place to track all concepts on Earth must be setup to handle the simplest of all cases- the curator of the GSD provides a unique identifier that they assert will only change if their curated concept changes. If you don't have a data model for this basic use case in place, then you're not handling the best case scenario.

To me, the OTUid (QID for example here, but really could be a big UUID, whatever - just no meaning plz) coming from the curator of a GSD (these species concepts don't just come from nowhere, they come from blessed lists of various quality as curated by a human) is the single best way to track differences. If the curator changes the id, they understand that they are asserting a new taxon concept. The way we teach them to think about this is that if you had concept A, and you did science 1, and then concept B, and science 1, you hypothesize that you might get a different answer. We force curators to think of list of OTUs, not list of names because the CoL is a list of OTUs, and the names we can use to get near to them.

I wish I had my philosophy of science notes from undergrad back in front of me. The course so elegantly pointed out all the problems trying to uniquely identify things. Definitions based on sets, expanding and contracting definitions, all chairs and not chairs, etc. etc. All of them failed in some cases. This is extremely well understood philosophically. The exercise here would fit right into one of those bodies of thought. What to do then? At the end of the day, what you need are meaningful units. What is a meaningful unit in our case? The thing you can do science with. What thing? A species concept, something "real". That unit, gets a single, anonymous ID, Q, or other meaningless URI, etc.

To your questions:

Do they always have some sort of text-string label associated with them? I'm assuming the QID at least, but is that the only way to cite them?
- THe QID is how to uniquely identify them. How we localize to that concept (localizing to information being a very useful concept that should be embraced IMO) can happen in many different ways, names, "things that are red", hyperlinks, printing the QID, it doesn't matter as long as a one has a reasonable path that works in most cases (not even all, I understand biology is vast and tricky).
What properties of the "Data" part help you determine whether you're dealing with a new instance of an existing QID-branded OTU, vs. an OTU that requires the minting of a new QID?
- A) In part this doesn't matter presently, because we're not asking our GSDs providers to actually provide data, just localizers to some concept, that isn't uniquely identified. So, given the data in CoL alone I can't do anything to determine if the concept has changed (thus my rants that @mdoering is being asked to do the impossible). B) If they were providing actual data defining the concept then as a scientist I would look to see if that data is suitable for the hypothesis I am testing. For example I am not selecting morphological species concepts to hypothesize about rates of molecular evolution. To assume we can do good science in the absence of understanding (for example by proxy of an algorithm defined identifiers) is foolhardy IMO. This is particularly critical at scale, i.e. across all species on Earth. If the CoL starts minting ids under one namespace for all its aggregated data from its data-sources of various quality, then what's going to happen? Those who are doing "robust" science are going to assume CoL has done due-diligence and things with similarly namespaced IDs identify things with similar meaning. They don't, of course we'd never claim this. So, the basic thing I'd like to see is to have the curators, the people that make GSDs possible, put their money where there mouth is and say- "This is a species concept you can do work with, and you can do similar work with the rest of things on my list (your list may differ). You can also assume that I'll be damn sure to provide a different QID if I enumerate the list of species and come up with something different. I sure hope your global list has a basic spot to track my unique assertions that isn't some name, we know that nomenclature is nuts." This is after all the job of GSD providers (currently the sole source of data to CoL, if that changes then elements of this argument change). How do they do this? They provide a QID (or UUID, etc., some globaly unique ID) to uniquely identify their concept. QID stays the same, and names change, @mdoering knows that's just nomenclatural mumbo jumbo. @QIDs differ, and names stay the same- similar mumbo jumbo. Different concepts identified by different systems of IDs? Well, get your science on, localize to those concepts and figure it out, that's about as best you can do. Any taxonomist does this naturally without thinking, it seems very very strange to me to pretend this isn't necessary. It also seems very very strange to me to hide this hard work with algorithm based identifiers so ecologists can do bad science. We can do better, we can trust GSDs and give them simpler ways to uniquely identify their concepts. TW is trying to do this, and there have been many cases where because we've linked data to the right concepts, things (data curation actions) have become trivially simplified. For example specimens are determined as OTUs. When nomenclature facts are added the specimens don't MOVE. They don't change OTUs. The facts are added to the nomenclature, and the OTU is pointed to its current valid nomenclatural name. You can stack as many OTU determinations on specimens as you want, each can reference the nomenclatural facts as they were presented. At the end of the day a curator has specimens under some current OTU, and they have the history of determination separate from the history of nomenclature, or inter-twined in a timeline display if need. Exactly how nomenclature is supposed to work IMO. Now, if you linked specimens to names instead of the proxy OTU concept (which too many systems do presently), and you had to split or merge the names, you'd have to somehow decide which name keeps the links to the specimens, and which doesn't Ugh! This is just one example of the nice division of labour we get when we have an OTU table, and manage nomenclature as nomenclature.

TLDR - I don't believe we can do better without a different data model at the core ("anonymous" nomenclature free concepts), and better tools and processes for GSD curators.

@dremsen 👍 I think that's exactly right! But when you say "add other taxa", at the species level what that means is that you are adding another heterotypic synonym, which means you're adding a new type specimen to the concept. However, it's not that simple. First, there are all the OTUs that don't have Linnean-style names. I fully agree with @mjy that requiring [Linnean-style] nomenclature to define biological concepts doesn't universally work. So the "type specimens as boundary markers for concept circumscriptions" can only go so far (i.e., can only really work int he context of taxa signified with Linnean-style names anchored to name-bearing types).

Second, consider this scenario:

Smith 1950 describes the new species Aus bus from specimens in Hawaii (TNU: Aus bus Smith 1950 sensu Smith 1950)
Jones 2010 describes the new species Aus xus from specimens in Palau, and declares its closest relative to be Aus bus Smith 1950 (TNU: Aus xus Jones 2010 sensu Jones 2010; TNU: Aus bus Smith 1950 sensu Jones 2010)
Pyle 2015 treats Aus xus as a synonym of Aus bus (TNU: Aus xus Jones 2010 sensu Pyle 2015; with TNU: Aus bus Smith 1950 sensu Pyle 2015 as a heterotypic synonym)

We've got five TNUs here, four of which represent taxa asserted to be valid. The fifth TNU is Pyle's assertion that the type specimen of Aus xus is conspecific with the type specimen of Aus bus, and because Aus bus has priority, his (Pyle's) concept is labelled as "Aus bus", but it includes both Jones' concept of Aus bus and Jones' concept of Aus xus (not always the case, but for sake of simplicity, let's say it's true in this case).

So, suppose the 2009 CoL has ID1234 associated with Aus bus, which we'll infer to be Aus bus Smith 1950 sensu Smith 1950.

Now Jones comes along in 2010 and names Aus xus, so CoL mints a new ID9876 for Aus xus Jones 2010 sensu Jones 2010 to include in its 2011 Catalogue.

Here's the kicker: Does CoL issue a new ID for Aus bus? If so, why? How would CoL ever know whether this is a case of Aus bus being "split" into two species by Jones, or it's just a new discovery of a new sister-species (Aus xus) to the already established Aus bus?

The problem is that Smith 1950 didn't examine any specimens from Palau, so we have no idea whether he would have included specimens from Palau within his circumscription of A. bus, or if he would have agreed with Jones that the Palauan species is different. So at this stage, CoL can't decide, based on the information it has, whether it's representation of Aus bus needs a new ID, or can keep using the same ID.

However, suppose that CoL has a TNU-based model, and for its 2009 catalogue it anchored the record for Aus bus to the treatment of Remsen 2005 (TNU: Aus bus Smith 1950 sensu Remsen 2005). With a little bit of @nfranz - style sleuthing, we discover that Remsen examined specimens from Palau and declared them to be Aus bus. Now we have a good idea that CoL had defined its record for Aus bus s.l., so by recognizing a portion of this circumscription in the form of Aus xus Jones 2010, we know that a new s.s. circumscription is needed for the CoL record of Aus bus, and this a new ID is created for Aus bus s.s. to distinguish it from the earlier CoL record with ID1234.

Of course, it's rarely the case that there are only two alternatives, so "s.l." vs. "s.s." is kind of useless. A MUCH better approach is to, instead of "sensu lato" and "sensu stricto", CoL explicitly uses "sensu Remsen 2005" and "sensu Jones 2010" (respectively).

The problem, though, is that it takes a bit of @nfranz - style sleuthing to make this determination, and CoL can't incorporate that information into its records. However, it can make something of Aus bus Smith 1950 sensu Pyle 2015, because this TNU also reveals the second type-specimen-by-proxy of the protonym link embedded within Aus xus Jones 2010 sensu Pyle 2015 pointing to Aus xus Jones 2010 sensu Jones 2010.

If anyone actually followed that, I'm deeply impressed (I had to re-read it several times myself, and I still probably screwed something up). But here's the short summary point: With a TNU model, you can do a pretty powerful job reasoning/computing backwards through time (e.g., comparing Aus bus Smith 1950 sensu Pyle 2015 to Aus bus Smith 1950 sensu Jones 2010), but it's much harder to reason forward in time (e.g., comparing Aus bus Smith 1950 sensu Smith 1950 to Aus bus Smith 1950 sensu Jones 2010).

OK, more to come, but I'm approaching this one point at a time.

And it seemed we were getting so close to a resolution of the issue...

My sense from this discussion is that there are (at least) two different approaches to the topic.

ids should reflect our knowledge of taxa, and two taxa have the same id only if they are the same. If taxa change, they get a new id. I note that agreement on the meaning of same and change seems, um, elusive (cue numerous "A. us, A. bus" discussions), but I digress. Hence with each interaction you want ids that faithfully reflect current taxonomic understanding, and hence reflect changes in taxa (however defined). One consequence is that downstream users of these ids (e.g. people linking to them in their own databases) will be faced with regular changes to some (most?) ids.
ids should be as stable as possible so that they provide a reliable basis for external linking (e.g., by downstream users, Wikidata, etc.). Hence with each iteration, the goal is to minimise changes in ids. Downstream users will be able to link with confidence that the id is likely to be stable, with the proviso that what the id represents may itself have changed in ways that some users would consider meaningful (e.g., a genus has acquired additional species from another genus).

I am not sure we can do both, so I think the real question is which outcome best reflects CoL's goals? I'm guessing it's no surprise that I value identifier stability (2) more than fidelity to particular taxonomy (1), so I would vote for 2. This also means that I regard the "A. us, A. bus" discussions as essentially beside the point. The likelihood of me using CoL identifiers is mostly a function of their stability and how interconnected they are with other identifiers.

Obviously, if faithful representation of what the ids point to matters more (e.g., you can't accept that a genus with the same name but different component species can have the same id), then you will favour 1, and then the crucial issue is defining a set of criteria for determining identity of taxa (@mdoering original question before the rowdy neighbours turned up with alcohol and music).

In a sense these aren't completely different positions (obviously 2 still depends on some notion of same, in this case similarity of edges in the graph i.e., parent → child pairs having the same labels) but it seems to me that 1 is effectively blocked in the absence of agreement on the operational meaning of same. Likewise as @mjy has pointed out, advocates of 2 (e.g., me) have argued for a simple tree diff approach without demonstrating a working system.

So, in summary, it seems to me that there are two separate goals here: fidelity to changing concepts vs stability of identifiers in the face of change.

@mjy

Any aggregating system in place to track all concepts on Earth must be setup to handle the simplest of all cases- the curator of the GSD provides a unique identifier that they assert will only change if their curated concept changes. If you don't have a data model for this basic use case in place, then you're not handling the best case scenario.

The CoL model does handle that obviously. But CoL deals with very heterogenous data from a wide range of sources (we prefer to avoid the term GSD as the sources are often not "global" and also not limited to "species"). Some do have ids, some do not at all. And what they represent we hardly ever know. It might be database records that change their id by some evil algorithm. It might be name identifiers, it might be "OTU" identifiers. We do not know. But even if we did have ids for OTUs from each and every source, they would never apply the same methods or rules for defining a concept. It's different between larger taxonomic groups, it might be a more molecular driven, it could be more or less phylogeny driven, it could be more of a splitter or lumper philosophy. It surely is never consistent. You could argue we do refer to the original source and can just forward the responsibility of the idea of a concept to them. But for an end use the CoL becomes even more heterogenuous and they would have a hard time understanding what that id means and if they can trust them for their purpose. My main reasons for having a genuine, consistent CoL identifier based on some agreed method are:

consistent identifiers which share the same meaning and behavior on inserts/updates/deletions that can be predicted
missing source ids: we will have to cater for missing source identifiers in any case
edited concepts: even though we borough from the sources the CoL still is an edited product and classifications and even to some degree the exact name and status can deviate from the source. For the extended catalogue that we are building we will even aggregate information from several sources for the same taxon, e.g. add a missing reference, more synonyms or vernacular names.
missing evidence: the CoL data model, and most often also the sources, lack information on the exact concept description. If a change happens the data seen by the end user might be exactly the same! What good is that to a user if he cannot tell apart the concepts from the data he gets? A changed id for a byte wise identical record hardly makes any sense.

It would be an option to use a hybrid approach and treat source differently. We could mark manually selected sources as having properly curated taxon identifiers and blindly follow their changes while others fall back to the default CoL provided ids.

Reptile DB btw does not even expose their identifiers but instead recommends to link to their species pages by their search by name URL: http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis

All: This is probably the most useful discussion I've had in months (if not years), because it actually feels like we're getting somewhere on a topic where wheels have been spinning and spinning. So fair warning and apology, much more to come.

Here I want to "see" the hypothetical from @dremsen and "raise" it into an actual, real-world example from my dive today. But first, a nit-pick:

to paraphrase a past Rich Pyle, he is saying "The specimen you caught is conspecific to the type specimen that I collected and is thus congruent with the concept I had when I caught my specimen and formally described it as a new taxon according to the rules of the code."

The first part of that is right, but the "congruent" part is a bit off. I would actually phrase it as:

"In my taxonomic opinion, tThe specimen you caught is conspecific to the type specimen that I collected, and ~~is thus congruent with~~ thus falls within the species-level concept I had when I ~~caught my specimen and~~ formally described ~~it as~~ a new taxon and established my specimen as the name-bearing type according to the rules of the code. Because no other earlier-established name-bearing type falls within my concept, then the correct name for my concept, and thus your specimen, is Chromis abyssus."

Just because I include your specimen within the same circumscription that I have in my head for C. abyssus doesn't mean that my concept is necessarily congruent with any other concept.

Anyway, getting back to the real-world example. This is a cropped frame grab from a video I took today:

Chromis_pacifica_Hawaii

It's in the same genus as the one in the @dremsen hypothetical (Chromis), but this one lives shallow and is probably the most common species of its genus in many places where it lives.

As an Ichthyologist born and raised on Hawaiian reefs, I have no trouble identifying this as Chromis agilis, described by Smith, 1960 (see Protonym in ZooBank). Don't take my word for it, check it out yourself.

Here it is in CoL.

CoL cites FishBase as the source database (GSD), where the online resource is. Going to that link reveals a distribution map showing broad distribution across the Indo-Pacific, and cites Allen 1991 as the "Main Reference". The record in WoRMS is derived from the same source.

Here is the record in ITIS. And here it is in Catalog of Fishes.

This is about as stable as taxonomy gets. At least it was... until last week, when this was published.

You can read the PDF if you want, but the short story is that Allen & Erdmann came to the conclusion that the Pacific populations represent a different species from those in the Indian Ocean. The type specimen of C. agilis is from the Seychelles, and it turns out that the taxonomy has been so stable since 1960, that no synonyms have ever been described from anywhere else (including the Pacific). So Allen & Erdmann decided to describe the new species [Chromis pacifica], based on a type specimen collected in the Coral Sea.

So... we have 33 TNUs in GNUB hooked into the Protonym for C. agilis:

Here's the challenge: How many OTUs are there? Is this the same as the number of CoLID values there should be? What additional information would you need to determine how many OTUs?

In my proposed pathway to salvation, I would have CoL harvest one more piece of information from the GSD source record for C. agilis: the TNU for FishBase's "Main Reference". It's in the list above as Chromis agilis Smith, 1960 sensu Allen 1991. It would take me about one weekend to hook all the existing CoLID values derived from FishBase into the corresponding GNUB Protonyms and FishBase "accepted" TNU values.

In the next cut of FishBase that is imported into CoL, you would note two things: 1) The addition of a new Protonym for Chromis pacifica Allen & Erdmann, 2020 sensu Allen & Erdmann 2020 2) A new "Main Reference" from FishBase in their record for C. agilis, pointing to Chromis agilis Smith, 1960 sensu Allen & Erdmann 2020

Thus, CoL would mint a new ID for C. pacifica (because it's a new name not previously imported into CoL), and would mint a new ID for C. agilis (because the "accepted" TNU from the source GSD changed).

In the long run, CoL would stop minting IDs altogether, and simple make statments along the lines of:

"With regard to Protonym Chromis agilis Smith, 1960 sensu Smith 1960, we defer to FishBase, who follows Chromis agilis Smith, 1960 sensu Allen, Erdmann 2020

You could cache a bunch of other metadata, of course, but the core service provided by CoL would be an endorsement of Chromis agilis Smith, 1960 sensu Allen, Erdmann 2020, determined via FishBase.

OK, more on the rest of @dremsen 's hypothetical post in a moment.

@mdoering

Reptile DB btw does not even expose their identifiers but instead recommends to link to their species pages by their search by name URL: http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis

Yes, which strikes me as bad design, made worse by the result if you use the URL that would have applied to this species before: http://reptile-database.reptarium.cz/species?genus=Sinonatrix&species=yunnanensis :

Species Sinonatrix yunnanensis was not found! You can try find it as synonym, or use advanced search for searching it other way.

Given that Reptile DB knows that Sinonatrix yunnanensis is a synonym of Trimerodytes yunnanensis I don't understand why they can't just take you to Trimerodytes yunnanensis !?

Anyway, the integer ids I use are in the database dumps, and on first glance seem stable across releases of the Reptile DB checklist.

@dremsen :

Chromis abyssus, Pyle 2001 Chromis hawkeswoodii, Remsen 2020 (heterotypic synonym) Chromis margaritifer Fowler, 1946, sec Döring 2020, (chresonym) misidentification. #this is removed by the COL because it is not a 'real' synonym and should not be used to improve recall in search.

The only way that third one has any place in this discussion about circumscriptions/concepts is if you're going for the extrinsic approach of defining concepts/circumscriptions by enumerating lots and lots of individual organisms. We must have different interpretations of the meaning of "chresonym" (a term I've never liked, or used); because I do not see that third one as a chresonym. I don't even see it as playing a role in taxonomy. It's a dispute about the identification of a particular organism, which is a whole different thing from reasoning across taxon concepts.

Pyles concept of abyssus hasn't really changed. Pyle would assert that the holotype of hawkeswoodii is conspecific to abyssus. As we saw, he did that originally when my fish was just a specimen, prior to me doing all this work and turning it into a holotype. But since I went to the effort to enter my concept and associated specimen into the pantheon of taxonomy by following the nomenclatural rules, his concept has changed. It includes the original 'protonym:' a nomenclatural term which is really a proxy for Rich's novel concept. It also now contains Remsen's concept of hawkeswoodii and it's associated protonym and holotype. Protonym count went from one to two. Concept changed.

Exactly! This is what I was trying to get at. We don't know the relationship between Chromis abyssus, Pyle 2001 sec Pyle 2001 and Chromis abyssus, Pyle 2001 sec Pyle 2020. But we do know the relationship between Chromis abyssus, Pyle 2001 sec Pyle 2020 and Chromis abyssus, Pyle 2001 sec Remsen 2020 (assuming Remsen regarded C. abyssus as a valid and distinct species). That's because both Protonyms are referenced in both publications, so there is computable logic here. We don't know how Chromis abyssus, Pyle 2001 sec Pyle 2001 relates to the others unless we do some @nfranz -level sleuthing.

OK, I'll stop replying until I'm caught up reading.

Dear all, What an interring discussion! but difficult to follow getting in it today after 60 emails at my counter… A few quick thoughts even if I’m not sure, they are relevant to this discussion...

Taxonomy knowledge versus taxonomy usage. I think we need to separate taxonomy knowledge and taxonomy focal usages/practices of taxonomy (meeting specific needs). In the digital sphere the first needs a complete formalisation of what is a taxon, in the latter one tolerates/accommodates with some ambiguity because it serves/answers to local/focal purposes. If we could achieve the first, we could easily take what is needed in it to operate the second, but starting will focussing on the second, we‘ll have to reinvent the wheel each times from the local/focal objectives they want to serve. And thus we get the current landscape with lot of ways (tools, identifiers, practices …) to address taxonomy according to specific interests. I already mentioned this in CoL’s 2017 Wood hole meeting and discussed it with Rich and it is the spirit of the why my paper (still not published) about taxon formalisation that several of you have already read (or reviewed): the goal should be first how to transfer/translate taxonomy knowledge in the digital sphere even if trying meeting the needs is of course necessary. All this discussion shows well that a complete formalisation of what is a taxon and how to represent it in the digital is still a pending issue.
Approximation in terms. If we agree that a concept has (at least) 3 major properties (Name (N), Taxon defined by circumscription (Tc) and Taxon divined in intension (Ti), then the taxon concept we use in the new CoL represents only Tc, not the the complete taxon (T). This is a semantic shortcut we need to be aware of when looking for taxonomic identifiers beside CoL. The best that can address the new CoL is identifiers for N, Tc, N+tc but not for T that CoL does not address completly! There is no taxon concept in CoL. These 3 properties are clear I think for everyone but there is probably a forth one: its dynamic component (biological nature, conceptual perception).
Identifying what? Taxon identifiers are needed for the practice of taxonomy itself and for external usage of it. However having them, they fixe the taxon as a static entity while it is a involving concept from both its biological nature and its conceptual perception. I know quite nothing about identifiers but at least such an identifier should be able to address this paradox. Addressing the issue by any subset of N, Tc and Ti would fail to identify fully a taxon (but some subsets might be enough to answer specific needs). Using names only has shown to be inappropriate. Using circumscription (Tc) only (or with names) remains incomplete and addresses the concept of the taxon (not the taxon!) and part of the concept only. Its take into account its usage (children taxa) and is approached by capturing the taxonomic literature. However circumscription is not only about children taxa: each time a new ref is added, the concept of the taxa addressed is also changed because it encompasses all the biological attributes associated (i.e. taxon properties) with the specimens it groups: what encompassed Drosophila Fallén in 1830 is totally different of today in terms of children taxa of course, but also in terms of its distribution, ecology, … we are referring to the same biological entity (the taxa) but no longer to the same concept. Tracking the taxon name usage is not sufficient to formalize complety the taxon as a biological entity. Similarily, each time a taxa is moved in the classification, its definition (intension) changes (the topic of my paper): we have same biological entity (the taxon) but not the same concept. - [and by the way: tracking all changes by circumscription (= tracking all occurrences) is no less an enormous task than tracking all changes by intension (= tracking all classifications changes): if we agreed to do the first (=GBIF) we could also do the second] -.

In other terms, for the usage of taxonomy we want/need taxon identifier for taxon as a biological entity, which is neither its name, neither its concept in any of its definitions But all of them are useful to represent part od the taxonomic knowledge in the digital sphere. As put in my paper, the Berendsohn notation "Aus bus Author, Date, sec. Author Date” (is Rich ‘sensu’ a similar one?) remains the best way for me to identify clearly a taxon, even if 'sec Author Date’ represent itself a concept (concept of classification) that also itself is not static, is also hierarchic (sec Author Date in a higher system of classification sec Author2, date), and evolves with progress of taxonomy/phylogeny knowledge. A taxon identifier focussing on such statement is probably the best solution we could have for the moment.

Species and higher levels. To meet the needs of the users of taxonomy the focus is mainly put on the species level which explains why taxon concept by circumscription has been favored. Semantic shortcuts even in this discussion were done for ‘species' instead of ‘taxon' and even Dave’s taxonomic story deals with species. The higher changes occur in the taxonomy, the higher the consequences are for the taxon species. Even if species is a just a rank as any other rank in taxonomy, and if in theory no differences should occur, a species level only reasoning might introduce some biases and I think that consequences of these higher changes needs to be better investigated: I’m not fully sure if new unexpected issues might occurs.

BW, Th.

Le 24 août 2020 à 10:56, Roderic Page notifications@github.com a écrit :

@mdoering https://github.com/mdoering Reptile DB btw does not even expose their identifiers but instead recommends to link to their species pages by their search by name URL: http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis Yes, which strikes me as bad design, made worse by the result if you use the URL that would have applied to this species before: http://reptile-database.reptarium.cz/species?genus=Sinonatrix&species=yunnanensis http://reptile-database.reptarium.cz/species?genus=Sinonatrix&species=yunnanensis :

Species Sinonatrix yunnanensis was not found! You can try find it as synonym, or use advanced search for searching it other way.

Given that Reptile DB knows that Sinonatrix yunnanensis is a synonym of Trimerodytes yunnanensis I don't understand why they can't just take you to Trimerodytes yunnanensis !?

Anyway, the integer ids I use are in the database dumps, and on first glance seem stable across releases of the Reptile DB checklist.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/general/issues/6#issuecomment-679000713, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZIOGZ2LUP3IJS2SNXHZPDSCITMRANCNFSM4DKBXVWA.

CatalogueOfLife / general

Define objective rules for taxon concept identity #6

Offtopic!

On topic, but not very constructive

Warning, off topic sensu my take on requirements for #6, includes themes repetative with previous spewing by me