Open mdoering opened 7 years ago
My recollection of the AviBase model (which could be wrong) was that everything got a distinct taxon id (even if their 'computable' circumscriptions were identical). Subsequent articulations would establish they were congruent.
@rdmpage I was thinking similar. Like in the Plazi timeline you nail down the concept by the timestamp. But concepts exist also in parallel and do not follow a sequential timeline.
Also the main problem here for me is how do I know when looking at the data that the concept has changed? Ignoring identifiers completely as we want to (re)assign them in the process. What warrants a taxon change and when does it remain the same? We need to find objective rules what the algorithm has to do. It also gives a real meaning to CoL ids as they are based on objective rules.
@mdoering I think "when does it change and when is it the same?" leads to madness. And it's separate to the identifier issue, in that at one level every taxon that includes a given protonym would have the same identifier. Every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to? I don't know of any particularly useful way to say whether a taxon is the same or not (that doesn't quickly lead to absurdity) but you can ask whether the taxa share types. I guess I'm arguing that any approach that asks either "what is a taxon" or "when are two taxa the same" is digging a hole for itself.
Usages are much more friendly to work with if you think of publications. Then they are immutable and you have a limited number of them per name. But for dynamic databases it becomes a different problem. The system has to decide when to create new identifiers. In this sense every monthly release of the CoL would be a different usage and has its own set of taxon ids. I doubt this is very useful to anyone. It is much like ALA and what CoL used to do. Identifiers change all the time. Names are actually more stable.
But what we want are more stable ids than the name ids. To keep the taxon identifier the same if the concept is still the same regardless of its accepted name. But that requires either a human to do an assertion or a machine to compare taxa for equality. There is no way we can get human assertions for million of taxa every month. And they would also be very subjective and the rules applied to judge would differ a lot.
My goal is to start with the type specimen as the anchor for a taxon, but refine that for splits and merges by comparing adjacent taxa and their types. If you have a globally complete taxonomy and compare several versions of it (1960/70/80 in aboves example) a missing protonym for A. fus
tells you the A.fus
you are dealing with is from 1960. And the presence of A.bus as a pro parte synonym in 1970 for 'A.fus' tells us its a split. So we know (1) is the union of concepts (2) and (3).
The goal is to create stable taxon ids as anchor points to link identifications to. The current name can then happily change and if a split or merge happens the id will change and the identification is still referring to the old broader or narrower concept.
Well, the base for the taxon is the set of types. In the diagram this already identifies 4 out of 5 concepts. The A.bus s.l.
vs s.s.
is not covered and they would both fall into the same id. But maybe that is still a good start for CoL. It's simple, straight forward to implement and is definitely a step forward from name based identifiers which most systems have including the current CoL and GBIF. And its easy to communicate and most importantly it can be derived from the data
Markus, thanks for those clear statements. This is the direction I also favor.
"Fundementally doomed indeed. :("
Types frantically for 20 minutes than deletes everything.
same
Can anyone answer whether a taxon concept sensu the title of this issue has biological attributes?
I.e. do you imagine that following triples are part of the system:
my_taxon_concept_uri has_some height
my_taxon_concept_uri has_color purple
my_taxon_concept_uri eats snails
No they clearly won't. No traits and description based circumscriptions are planned to be in CoL. And when I write about types we can manage type specimens, but I doubt we ever list them for all species. So using the protonym as a type proxy is what will be done.
@mjy
Can anyone answer whether a taxon concept sensu the title of this issue has biological attributes?
I.e. do you imagine that following triples are part of the system:
my_taxon_concept_uri has_some height my_taxon_concept_uri has_color purple my_taxon_concept_uri eats snails
I can't answer for the thread, but I only got into this now because this is the issue that arises in Wikidata. People are adding attributes like these to Wikidata "taxa" when it seems clear that many such "taxa" are names not taxa (in the sense that homotypic synonyms may have their own Wikidata items, so clearly "taxa" aren't always "taxa").
So I guess where you are going with this is what do we hang attributes on? I was imagining that, again, we could have a hierarchy of identifiers. You could just hang it on the name, or the protonym (in the sense of the complete set of usages of the name), or on a specific usage (equivalent to "sensu Fred").
@mdoering
Well, the base for the taxon is the set of types. In the diagram this already identifies 4 out of 5 concepts. The A.bus s.l. vs s.s. is not covered and they would both fall into the same id. But maybe that is still a good start for CoL. It's simple, straight forward to implement and is definitely a step forward from name based identifiers which most systems have including the current CoL and GBIF. And its easy to communicate and most importantly it can be derived from the data.
I wonder if part of the problem is the notion of "concept" and that each box in the diagram needs its own identifier of the same "class". Put another way, I would have three "paths" or timelines, one for each type. Three "protonym" identifiers, one for each. Each identifier points to the entire history of each type , and events along the way are marked on those timelines. Each one of those events gets an identifier ("usage"). So you can still refer to A.bus s.l. or A.bus s.s by referring to a given usage. Now, some of these paths will intellect in the sense that someone may say that these two things are heterotypic synonyms, so the graph would need the option of having an edge between two paths (I think this is essentially what the Australian NSL does in their model).
I'm still waving my arms around here (can you tell? ;) ) but I do wonder if part of the problem is seeing things as boxes rather than as timelines.
@mdoering
Usages are much more friendly to work with if you think of publications. Then they are immutable and you have a limited number of them per name. But for dynamic databases it becomes a different problem. The system has to decide when to create new identifiers. In this sense every monthly release of the CoL would be a different usage and has its own set of taxon ids. I doubt this is very useful to anyone. It is much like ALA and what CoL used to do. Identifiers change all the time. Names are actually more stable.
Argh, no! Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same. Secondly, these things would (if they are changes to existing records) simply be a new usage along the path for that "protonym". The protonym itself would be stable. Someone linking at that level of resolution (e.g., "I don't care about the details, it's Drosopholia melanogaster as far as I'm concerned") wouldn't be affected.
Put another way, we're talking about versioning, and it seems to me the only sensible way to deal with that is enable people who don't care about versions (which, I'd guess, will be 99% of people) to have something stable to point to (which is more or less what names have done for us), and give those who do care something they can point to.
I was imagining that, again, we could have a hierarchy of identifiers. You could just hang it on the name, or the protonym (in the sense of the complete set of usages of the name), or on a specific usage (equivalent to "sensu Fred").
Yes, we will have different ids for a name and a usage. But we want to keep the CoL usage ids stable across versions (currently monthly) so we need to figure out whats considered still the same taxonomic concept. For names it's simpler, but even there it is not obvious as it still requires a clear definition of a ScientificName.
The current version of CoL only generates stable ids for names.
Argh, no! Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same.
That raises again the question which properties exactly belong to a usage. We are dealing with a graph where everything is connected. What are the boundaries of the object to version? Is the entire classification included or just the parentID as in our model? What about children? What about the structured reference of the publication? Every single distribution record, vernacular name etc does have impact too? If thats all in, you easily end up to have new versions all the time.
Secondly, these things would (if they are changes to existing records) simply be a new usage along the path for that "protonym". The protonym itself would be stable.
Are you sure the protonym is stable? It is in the theory and because it has been published on paper. But in a database world? See above for a usages boundary. Adding a vernacular name could break it. I guess it would be rather the protonyms name id.
Put another way, we're talking about versioning, and it seems to me the only sensible way to deal with that is enable people who don't care about versions (which, I'd guess, will be 99% of people) to have something stable to point to (which is more or less what names have done for us), and give those who do care something they can point to.
Yes
@mdoering I haven’t kept up with CoL’s data structure, but naively I would have said that the “concept” is the latest name + reference combination (e.g., A. bus + DOI:10.1234/xyz) and if there’s not a more recent usage then the id for the latest “concept” would be unchanged). I put “concept” in quotes because it seems that everyone has a different idea of what that is.
It’s also not clear to me who the indeed users are, and what their expectations would be. Clarifying that presumably would affect what identifiers to expose.
On 22 Aug 2020, at 09:59, Markus Döring notifications@github.com wrote:
I was imagining that, again, we could have a hierarchy of identifiers. You could just hang it on the name, or the protonym (in the sense of the complete set of usages of the name), or on a specific usage (equivalent to "sensu Fred").
Yes, we will have different ids for a name and a usage. But we want to keep the CoL usage ids stable across versions (currently monthly) so we need to figure out whats considered still the same taxonomic concept. For names it's simpler, but even there it is not obvious as it still requires a clear definition of a ScientificName https://github.com/CatalogueOfLife/general/issues/35.
The current version of CoL only generates stable ids for names.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/general/issues/6#issuecomment-678615971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUK2RBUHGP2A7QJUQX7SDSB6CIFANCNFSM4DKBXVWA.
@mdoering
Argh, no! Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same.
That raises again the question which properties exactly belong to a usage. We are dealing with a graph where everything is connected. What are the boundaries of the object to version? Is the entire classification included or just the parentID as in our model? What about children? What about the structured reference of the publication? Every single distribution record, vernacular name etc does have impact too? If thats all in, you easily end up to have new versions all the time.
I think this is a more general question about versioning graphs, and there’s a literature on that. Naively, I think in terms of edits between graphs, especially as this seems to capture the way taxonomists describe their work (e.g., “we created a new genus, and species x and y are transferred there” is essentially an edit script for transforming one graph into another). The other things you describe (publications, distribution records - really?) can ether be treated as separate nodes, or as metadata (I gather there are ways to version graphs that treat node properties separately from nodes). But just because everything is connected doesn’t mean you can’t isolate changes in pretty much the same way you can do a diff on text to isolate the insert/delete/move events.
Secondly, these things would (if they are changes to existing records) simply be a new usage along the path for that "protonym". The protonym itself would be stable.
Are you sure the protonym is stable? It is in the theory and because it has been published on paper. But in a database world? See above for a usages boundary. Adding a vernacular name could break it. I guess it would be rather the protonyms name id.
Put another way, we're talking about versioning, and it seems to me the only sensible way to deal with that is enable people who don't care about versions (which, I'd guess, will be 99% of people) to have something stable to point to (which is more or less what names have done for us), and give those who do care something they can point to. Yes
I guess I imagine a list where the protonym is the head of the list, and you append usages (name + reference) of any name connected to the type to that list. This list, combined with lists for other protonyms will form a graph. Snapshots at a give time will be a classification. And again, I think you can separate node properties from nodes themselves.
Yes, this is all a bit arm wavvy, but I’ve built trees for different versions of the Clements Checklist of birds using eBird ids for species, and you can clearly isolate the changes by comparing the trees using a tree-based diff. Incidentally this is only possible because eBird keeps stable ids for species independently of the current name. I’m currently trying to do the same with the reptile database where, mercifully, there are internal integer ids for species that remain stable even if the name changes.
I guess I see taxonomy as essentially a series of distributed edits on a graph, and if we capture those then we have basically captured what taxonomists generate in their work.
As an aside, here's a screenshot of a comparison between two recent classifications of snakes from the Reptile database. I use specific epithet plus internal integer id to identify each species (the id doesn't change for the species). This particular difference shows moving species from one genus to another (a move operation), taxa in light gray haven't changed. There's some added complexity that means the genus name itself has to change, but hopefully you get the idea. So I would regard things like "yunnanensis-18606" to be "prototnym-style" identifiers that are linked to the complete fate of names attached to the type for that name, and for which you could recover that this species has moved from Sinonatrix to Trimerodytes (and probably other moves if we go back further in time).
Personally I would postpone any discussion of whether a taxon is the "same", because I don't think there's a unique answer (what means "same"?). But if you have the history then you enable people to determine "same" or not for their definition of "same".
What I also like is we can attach a publication to this particular edit operation (i.e., the research that lead to the move), so it is linked to evidence, and to the people who did the work.
Having a connected graph doesn't make things impossible, but it needs a definition or specification of what we want to version. Surely you can ignore vernacular names, distributions and the ancestral classification. We just need to agree on what is relevant.
And for CoL we want to provide one definition of a concept. Even if it is not universal and there will be legitimate reasons to have others ids, I strongly believe it is very useful to have a global taxonomy with some sort of stable taxon identifiers that people can hang things like identifications on as long as the definition of what the id refers to is clear to everyone. And thus when it changes or not.
Having a stable protonym anchor is nice, but its too unprecise for many purposes (splits/merges). And having a long list of usages again is inflationary as long as we do not have concept relations between them. My goal is to provide identifiers for linking information that lie in between the two and are more stable than name based ones.
@mdoering where I'm going with this is that if you (or the CoL team) can't answer that question concretely/precisely then you will never get further towards answering this question.
Even if you can answer this question concretely I'm around 99% positive that you can not do what you want to do given the data you are given by the GSDs (no biological data, most without OTU ids). You have been tasked to do the impossible by the CoL. I'm serious. How, possibly, can you provide something more stable than the incoming data if those data don't have the requisite stability in the first place? This seems so obvious it's frustrating. We get data without the needed facts coming in, we mix it up, and VOILA BETTER FACTS. HUH!?
IMO @rdmpage is nailing all the key AHAs:
Perhaps this belongs in more of a blog format, but the coffee is flowing, so I'll post it here.
Reading back @rdmpage says " I was imagining that, again, we could have a hierarchy of identifiers.". I think we agree, but I look at it from a different angle. You don't even need a hierarchy bit (at the core), you just need an anonymous ID for stability. In WikiData the Q1234 is just fine, or maybe B12354 (where "B" is biological concept). Mint it, and surround it with facts. Any hierarchy can be added as an assertion, but it's not a central organizing principle. There, you have stability in as much as WikiData is stable, around which a concept can grow. The bonus- the WikiData identifier is resolvable, and the data there collapse down to computable statements. That concept has relationships to names, biological data, other "B"s. That's it. If people reference the Q12345, the concept will strengthen, if they cross-reference it to another system of identifiers, it will further strengthen. There is nothing magic here.
IMO things to avoid if you want to make it better:
I love the idea of seeing WikiData IDs seep into all the nooks and crannies, they are so simple. We just have to build the practical interfaces to it such that curators/taxonomists/scientists can draw from those IDs, and integrate them into concepts they work with on a day-to-day basis.
Getting back to CoL. What could be done?
IMO any other effort by the CoL is treating the illness, rather than going for the cure.
@mjy we want an algorithm that computes concept equality on the basis of stable name ids and the homo- and heterotypic synonymy given by a GSD. TW is very different that it is an editorial system. Versioning is simple when you can intercept record based changes. But imagine every change is done by bulk uploading thousands of taxa and names. You need to figure out what has changed and if it's a relevant change unless you want to version each and every record all the time.
You can rely on stable ids from outside (WikiData, IPNI, Avibase, GSD IDs such as in WoRMS or TW, you name it) that the GSDs (re)use and then blindly trust them. But this is wishful thinking right now and we would have to drop large parts of the catalogue. The CoL is an established project that we need to continue. Even if we trusted ids from the outside they would not follow the same rules and be very different in what they mean. The CoL is an aggregation of heterogeneous sources.
Thats why we decided to issue our own CoL ids (as CoL always did), based on some computable algorithm. The Taxon ID discussed here is something we have not started with, so details will only come up once we do so next year.
And really it's the same for name ids. Does any change to the record generate a new id or do we attach ids to the idea of a published name (usage) that is fixed, but for which we can change the name records "metadata".
And like I said above: The basic version of such an algorithm would just look at the set of types included in the synonymy to define the concept. And in the absence of good type coverage the protonyms will be used as type proxies. Such ids might not be perfect, but have a clear definition, are stable and an improvement over pure name ids (which we also have as a different way to link to CoL).
@mdoering "then, blindly trust them. But this is wishful thinking right now" - So no identifier is good enough, so you'll mint your own, based on data that contains "identifiers/names" that are not good enough, and some algorithm that pulls new facts out of the air. Then, on top of that you are then asking others to trust your new identifiers and the decisions that come from them... but not those other ones. I see no problems there ;).
You're still thinking of "changes". There is no versioning, it's only accumulated facts, that's the principle TW uses. This is precisely the core of a data model CoL needs ultimately. How it populates that model is the real tricky bit (thus this issue). My argument, and I"ll drop it, is that you can't get much farther than you do right now unless providers improve their data.
What do you mean by type? Specimens? Type specimens don't define biological concepts, that is a old, well known fallacy. Type specimens anchor name priority, that's it. It's a different edge in the model (Specimen -> Name, it has nothing to do with Specimen -> OTU/Biological concept). Overloading their meaning will lead to nothing but pain in the long run ;).
All: I woke up this morning to an inbox full of really interesting and exciting posts within this thread. You all know me well enough to know that I cannot remain silent. So the only practical option for me was to read the thread in sequence, and comment accordingly. Apologies in advance for re-stating points already made (think of them as "+1"s), and for the length.
@rdmpage :
except I would junk 7 and 8.
That is basically the same conclusion I've come to over these past couple of years. It may eventually be possible to develop these areas (identifiers/classes for circumscriptions and broader "concepts"), and/or maybe other groups are better suited to pursue it than the usual gang of suspects (myself included) that keep repeating these conversations across many years. I think they have potential value, and I wouldn't shut the door on them completely. But I think we need to walk before we can run, and at the moment we're (still!) in the transition stage between crawling and walking. There have been a bunch of conversations along these lines in recent months among the tdwg/tnc group.
So, this leaves #5 and #6, namely "protonyms" and "usages" (I'm taking #1 - #4 as essentially given, maybe subject to tweaks).
Yup. Same here. Reaching back to the language I used in that in related comments, Protonyms are the content, and Usages are the context. Both are the same class of "Thing", because both have the exact same properties. However, distinguishing Protonyms (as a subset or subclass of of all Usages; see tdwg/tnc discussion) is useful not because they represent a distinct "thing", but because they can serve in a special-case (and fundamentally important) kind of relationship with other Usages. This solves the issue raised in your recent iPhylo post. But please don't use the word "species" in this context (i.e., "...the importance of stable identifiers for species...", etc.). For every ten people who read your post, there will be 12 different ideas about what that word means in this context.
First, every protonym gets a nice, human-readable identifier, for example a combination of species epithet, author, and year.
Sure. Aus bus Linnaeus 1758
. If you want to be really consistent and unambiguous and explicit, you would structure that identifier as Aus bus Linnaeus 1758 sec. Linneaus 1758
. There are pros and cons to qualifying protonyms that way, which I'd be happy to elaborate on in another post, if asked. In either case, we should call it a "canonical name-string" or something like that, so that it's immune to spelling variants, qualifiers, abbreviations, etc. that might have been represented on the actual page within Linneaus 1758. But please, PLEASE don't assume that our electronic database systems will use these same human-friendly identifiers for internal identification purposes (e.g., foreign keys, or even urls). That would be a really bad mistake (see below).
Linked to this identifier is every homotypic synonym of that name ... This is essentially #5 (I think).
Yup. Exactly. See: 10.5281/zenodo.59790
Then imagine that same identifier is linked to every "usage" (name + reference pair) that we consider to be relevant, including heterotypic synonyms. This would enable a user to generate things like the current name and all synonyms, as well as go back and generate a snapshot of what the taxonomy was in, say, 1990. I think this is basically an aggregation of #6, and is close to the notion of a taxon concept being an "according to" statement.
WOW! FINALLY! Do you have any idea how long I've been waiting for someone else to write something like that? Seriously... THANK YOU!
One could imagine an interface (both web and API a bit like): ... /n/aus-fred-1909
Ugh. OK, well I can certainly imagine a service that takes those three parameters (epithet name, author, year) and finds how many matches there are. If only one match, it could function as an identifier and provide the relevent record. But based on content already in GNUB (202K Protonyms initially established as full species), about 7,000 (~3.4%) are non-unique across these three property values (original epithet orthography, authorship string, year). Granted, that's a small percentage -- but even 96.6% unique is pretty pathetic in the realm of "unique identifiers". (Fun fact: the author Malm described 24 different species with the name "linnei" in 1877; per ZooBank).
As I've pointed out many times, the amount of complexity needed to come up with an identifier for this sort of thing that is both human-friendly and unique vastly exceeds the complexity of having opaque identifiers (e.g., UUIDs) that are used by the computer for true identification, and then simply renders the results back to humans with a human-friendly label.
But that aside, yes -- we've already built and tested services of the sort you described. But the funding ran out before we were in a position to turn them into accessible APIs. That circumstance is changing (rapidly), so we may get these APIs up and running after all. Watch this space.
Everything else (actual "content" of each taxon, implications for characters of taxa, etc.) are all things one could compute from the classification if you wanted, but I think these are really separate things.
I absolutely, 100% agree!
If, for example, the identifiers were DOIs, clean and human readable
I know you love human-friendly identifiers, and I get that. But life is SO much easier if you have computer-friendly identifiers, then represent them via human-friendly labels whenever human eyeballs are in play. DOIs are WONDERFUL because of the rich dereferencing/resolution services. But they suffer the same fate as PURLs and other similar sorts of identifiers in that they conflate identification with dereferencing/resolution mechanisms. The best of all worlds can be achieved when you mint UUIDs as identifiers, then wrap them in a DOI prefix (making them dereferencable/resolvable), and then create a standard format for constructing a human-friendly label. The PLAZI/Zenodo team almost gets it right, in that they issue UUIDs to Usages (=Treatments), then Zenodo mints DOIs for them. Unfortunately, Zenodo doesn't embed the UUID within the DOI, so we have yet another identifier to track. For example: http://treatment.plazi.org/id/03EA878F-FF95-FFA5-4F81-1B00FB0E6CA9 sameAs http://doi.org/10.5281/zenodo.3806768
Sigh....so close....
@mdoering :
There are 5 concepts in those 6 usages in the diagram which I would really like to attach 5 different ids to.
I believe I recognize the handwriting/chicken-scratch in the whiteboard diagram as my own (and I certainly remember the animated discussion). The problem is not that we don't have enough identifiers. The problem is that we have too many. All of the boxes on that diagram get a separate TNU identifier. The problem is that there are dozens/hundreds/thousands of potentially relevant other TNUs for congruent circumscriptions, each with their own identifiers, and it's not clear which one to use. For example, there may be many publications that all represent the same set of heterotypic protonyms as shown in box 5 of the whiteboard diagram. Which one becomes the "identifier" for that box? The chronologically first to synonymize A. xus under A. bus? The one who provided the most robust taxonomic treatment? What people seem to want is a single identifier for "Box 5", which may serve as hub for dozens of individual TNUs all asserting the same/congruent taxon. This is where "TNU as surrogate for concept/circumscription" gets messy, and requires a third party to "elevate" one of the TNUs as the surrogate/proxy for the "box". But in any case, the problem isn't a lack of identifiers to use for concepts. The problem is an overabundance of them.
Also the main problem here for me is how do I know when looking at the data that the concept has changed? Ignoring identifiers completely as we want to (re)assign them in the process. What warrants a taxon change and when does it remain the same? We need to find objective rules what the algorithm has to do. It also gives a real meaning to CoL ids as they are based on objective rules.
Here is where I think we keep getting hung up. In order for "a concept" to "change", we need to come to some agreement as to what "a concept" is. How can you know whether it has "changed" if you don't even agree on what it is? Walter Berendsohn used the term "Potential Taxon", for what I called "Assertion", and which we now refer to as [Taxonomic Name] Usages. Every TNU represents a potentially different taxon (concept/circumscription). But depending on how one defines "taxon" (i.e., my #7, which both @rdmpage and I have decided is not tractable - at least not at this time), different people would use different mappings of which individual TNU instances map to which individual "taxa". So to say that "a concept" has "changed", we first need a definition for what "a concept" is, and even after we achieve that, it's often the case that insufficient information exists (within the publications, within our databases) to even know if the concept has changed. In theory, this would be wonderful. In practice, it's going to be a while before it can be meaningfully implemented. I think @nfranz understands this realm far better than anyone else, so I would defer to him on that point -- but the sort of stuff he has done explores the potential/power/limitations of this space. Personally, I find it both exciting and scary at the same time.
Reading further down the thread, I think @rdmpage nailed it with:
I think "when does it change and when is it the same?" leads to madness.
He also nailed it with this:
Every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to?
+1 (is it possible to add a "+5"?)
@mjy:
Link your biological data to OTUs (anonymous entities linked to nomenclature)
I would say "Link your biological data to TNUs" (each of which represents an explicitly defined or implicit OTU). Are we saying essentially the same thing? The nice thing about doing it through TNUs is that's often how it happens in the real world. Someone has an organism in-hand (biological data), and assigns it to a name by referring to some (usually published) definition of the name (field guide, key, etc.). The exceptions are the expert taxonomists who just "know" what species it is. But in such cases, they simply need to point to a TNU that represents the taxon in the same way they "know" it to be.
Stack citations, however many you want on any concept (e.g. OTU, Protonym, Franz graph relationship, relationship between OTUs, relationship between Protonyms, etc.). This is your timestamp proxy.
OK, so maybe we're not the same. I've recently had very long discussions with Kevin Thiele about exactly this issue (we even refer to it as "stacks" of TNUs aligned on a single "concept"/"circumscription" instance). But see my comment to @mdoering above: coming up with a shared definition for what these name-less taxon entities are, is the real barrier.
Flesh-and-blood-and-celluslose-and-cytoplasm Organisms exist in nature. Taxa do not. Taxa exist in the minds of humans. Humans communicate information about taxa (and the mappings between their imagined taxa and actual organisms) via text-string names usually embedded within publications (or other references). The text-string names are usually what get indexed in databases. But the name-in-context (e.g., "Aus bus Linneaus 1758 sensu Pyle 2020"; AKA a TNU) is the most effective and practical way to reference the interface between names, organisms and OTUs/taxa.
For what it's worth we have 100s of thousands of taxon names, OTUs, specimens, citations, and identifiers following this approach in TaxonWorks, i.e. it's not an imagined approach.
Substitute "GNUB" for "TaxonWorks", and I can make exactly the same assertion (and more than just specimens -- in fact, most of the organism occurrence instances are observations).
Back to @mdoering:
Usages are much more friendly to work with if you think of publications. Then they are immutable and you have a limited number of them per name. But for dynamic databases it becomes a different problem.
Yes, which is why I separate out the static TNUs from the dynamic Meta-Authority assertions. See, again, this publication, page 34, starting with the heading "Accepted status".
The system has to decide when to create new identifiers. In this sense every monthly release of the CoL would be a different usage and has its own set of taxon ids.
Not necessarily. Even if you can't stomach the Meta-Authority approach (where a new identifier is needed only when a particular perspective changes), you can just only issue a new identifier when it changes in substance (different synonymy, different classification, change in circumscription, etc.; more detail below) from one month to the next. Effectively each month's cut becomes a change log. The cut can include the full dataset, but the identifiers only change when the relevant content changes. You still need to define what properties within CoL warrant a new identifier; but I would suggest that you only change the identifier when the classification changes (including placement of a species epithet in a different genus), or when the set of heterotypic synonyms changes. If you try to get more granular than that, I think you'll be on the path to madness that @rdmpage alluded to.
CRAP! I just got to the post from @rdmpage that includes:
Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same.
OK, replace all of my paragraph above with "+1" on that post from @rdmpage . I could have deleted it, but what the hell -- maybe it says the same thing in a slightly different way.
My goal is to start with the type specimen as the anchor for a taxon, but refine that for splits and merges by comparing adjacent taxa and their types.
Yes, this comes back to the conversation we had in the living room of @dremsen. Use heterotypic synonomy sets as your computable mapping to when a new identifier is needed (i.e., protonyms as proxies for type specimens). This is imperfect, of course, when you don't have heterotypic synonyms listed, or when you need to divine the relationship between an earlier treatment and a later treatment (in the diagram, Aus bus sec. 1960
to Aus bus sec. 1970
). But honestly -- without a @nfranz -style analysis (which itself is still ultimately subjective), you can't ever know whether Aus bus sec. 1960
maps to Aus bus sec. 1970
; or maps to [Aus bus sec. 1970
+ Aus fus sec. 1970
]. In other words, you can't know from the data we generally have at our easy disposal whether Aus bus sec. 1960
was "split" into Aus bus sec. 1970
+ Aus fus sec. 1970
, or whether Aus bus sec. 1960
is congruent to Aus bus sec. 1970
. Someday, when the @nfranz approach has been fleshed out across all of taxonomy, then these sorts of questions will be computable. But until then, it's probably best not to go down that rabbit hole.
Ooops!! I just now read the next post:
Well, the base for the taxon is the set of types. [etc.]
I almost deleted the stuff I wrote above as redundant, but you can instead just treat it as a "+1".
@dremsen:
This is the direction I also favor.
I hope so! It was your living room, after all! :)
As to the set of posts related to this from @mjy:
Can anyone answer whether a taxon concept sensu the title of this issue has biological attributes?
@mdoering already answered exactly the same way I would, so I'll simply say +1 to his reply.
@rdmpage:
I'm still waving my arms around here (can you tell? ;) ) but I do wonder if part of the problem is seeing things as boxes rather than as timelines.
Another "WOW!" (+5). My arms are downright exhausted from years of waving in the exact same way. So, I already addressed this a bit above, but if those boxes represent specific/individual TNU instances, then I'm 100% onboard. If they represent abstract notions of name-independent taxa into which stacks of TNUs are folded, then I start to get a bit more dizzy. Again, I think the "set of heterotypic synonyms using protonym identifiers as proxies for type specimens" approach is (by far) the best path forward. Yes, some of the s.s. vs. s.l. distinctions will fall through the cracks, but those can be addressed later when we all catch up to @nfranz on this stuff. Whether we need to mint singular identifiers (of a different class) to represent sets of ProtonymIDs (vs. simply using the array of heterotypically synonymous ProtonymIDs as itself the mechanism for uniquely identifying the boxes) is, I think, an implementation question. I'd only advise exercising caution before dumping a new class of identifiers on the world, because you know it will be badly misunderstood and misused by the masses.
Back to @mdoering:
Yes, we will have different ids for a name and a usage.
If by this you mean "different classes of identifiers", PLEASE consider this carefully. I went through a LOT of painful mistakes when I did the same thing back in the 1990s; and when I saw the light that Protonyms are a subset of TNUs, the implementation side got MUCH easier.
Protonyms ARE TNUs; they're just a special subclass of TNUs. They have the same properties as TNUs. In 99% of cases, the Protonym is of the form "Aus bus Linnaeus 1758 sec. Linnaeus 1758" (there are exceptions, but mostly confined to old names that were first established in a non-Code-compliant way, then made available later -- this is something that should remain within the realm of nomenclators).
If you start minting different identifiers for the "Protonym" of Aus bus Linneaus 1758
, separate from the "Usage" Aus bus Linneaus 1758 sec. Linneaus 1758
, you will almost certainly regret it. At first glance it seems like the same identifier means different things depending on whether you're referring tot he Protonym of the name "bus", or the taxon concept asserted by Linnaeus in Aus bus Linneaus 1758 sec. Linneaus 1758
; but I promise that this distinction is just an illusion. It would require more text than I've already written above to explain why this is so. But I can share some of the LONG emails I had with Kevin Thiele, if you want.
That raises again the question which properties exactly belong to a usage.
Just to continue and expand from what I already wrote above, I have been using these four properties to represent a "change" in an objectively identifiable way:
1) Classification (i.e., immediate hierarchical parent; not the full hierarchy to the top)
2) Set of ProtonymIDs representing heterotypic synonyms
3) Rank (e.g., full species vs. subspecies) -- this is essentially redundant to #1, but not always (e.g., when you go from Aus bus subsp. cus
to Aus bus var. cus
)
4) Orthography (exact literal UTF-8 representation of the epithet only; not the combination)
You could also add:
5) Reference/TNU used as a anchorpoint/basis -- such as when a new publication comes along that doesn't change any of the four properties above, but provides a much more robust diagnosis/etc. and thus represents a "meatier" foundation. But for computational purposes, this doesn't really add anything. For end-users, it might (and that would also bring it a step closer to the Meta-Authority model).
On the whole "versioning" thing, I think the immediate/important questions most people want to answer are:
1) What is the status right now from the perspective of my favorite/trusted Meta-Authority (e.g., CoL)?
2) What are the various perspectives in the literature for a given Protonym over its past history (including the alternative "current" treatments/views that differ from my favorite/trusted Meta-Authority)?
I think most people are a lot less concerned with "What is the history of how my favorite/trusted Meta-Authority has changed its views over time? Sure, that information should be tracked, and is interesting in some contexts but it seems more of an implementation thing. The "versioning" approach is one way to do it, but that requires new identifiers. The way GNUB handles it is with a robust audit trail (literally every change of every field in every record is logged with a timestamp and responsible party, so there is no "version" per se, just a timestamped change log for each record).
@mjy :
You have been tasked to do the impossible by the CoL.
In some senses I agree, but there is a really, really, really simple thing that CoL can at least encourage GSDs to do, and implement itself when the content exists (e.g., content through WoRMS and other robust GSDs), which is simply track one more piece of information for each record, which is "Reference we follow in making our assertion about current status". In other words, the bit after the "sensu". If you can just get that much information, it would be a quantum leap in the utility of the data CoL provides. And even if only a minorty of content providers can offer this information, you can always skip that step with a place-holder sensu someobody but we're not sure who
approach, so at least the operational data model is functioning at the TNU level, not just the Protonym (or vague "name") level.
A big "+1" on all the rest of what was included in this post from @mjy (as well as several "+3"s and "+5"s!)
Also, LOTS of "+1"s, "+3"s and "+5"s (especially "Never, ever, ever embed information in the identifier...") in your follow-up pseudo-blog post.
As a matter of policy, encourage, slowly, but ultimately more forcefully, GSD providers to provide OTU ids.
I'm not sure it's the same, but I've been pushing hard (including above) for CoL to get the GSDs to provide a reference anchor point for each asserted "current status". We should move beyond the approach of "sensu GSD Year", and move towards "sensu Publication". Most GSDs are not practicing actual taxonomy within their databases; rather their databases usually serve as value-added indexes of what's happening in the literature.
that those GSD OTU ids be WikiData Q numbers.
Meh... I'm not sure that's the right choice. But I may be ab outlier in that.
What do you mean by type? Specimens? Type specimens don't define biological concepts,
Individual type specimens don't, but sets of types (as proxied through ProtonymIDs expressed as a heterotypic synonymy) most certainly do! I was at a meeting held at Smithsonian back in the 1990s, where this basic topic of discussion was focused in the context of FGDC Metadata Standards (of all things). Walter Berendsohn and Stan Blum and Bob Peet a few of the other early workers in this space were there. I outlined different levels of granularity with which one could define the boundaries of a taxon concept/circumscription:
The last of these is obviously the least granular, and some might argue that (therefore) the least useful. But in the 2+ decades since then, it has become more and more obvious to me that defining taxon circumscription boundaries through sets of type specimens (proxied by ProtonymIDs, as included in an asserted heterotypic synonymy). As my wife once said, "It's better to be vaguely correct than precisely wrong". And while sets of heterotypic synonyms (as proxies for their corresponding type specimens), while vague, are almost purely objective in nature, and as such are in the realm of "facts" (I strongly support the point by @mjy about assembling and growing set sets of objective facts). Also, one can never enumerate, extrinsically, all of the individual organisms (recently dead, still alive, and yet to be born); so there is always an implied non-explicitly-enumerated set of organisms that should be included within the circumscription. I've also never been a fan of the character-based approach, because you always get the odd mutant individual that happens to lack some key diagnostic character which, technically, would fall outside the circumscription (even if both its parents fell within).
Even if no heterotypic synonyms provided, you can still infer the scope of the circumscription as inclusive of all organisms up to but not including the most recent common ancestor of the nearest relative/protonym/type specimen that I regard as *noT8 within the circumscription (i.e., the other related taxa recognized as valid). For those of us who are OK with paraphyletic taxa, it's a little more complex (but not much).
Anyway, this same basic idea was fleshed out in even more detail with @mdoering and @dremsen in the latter's living room (same gathering that produced the whiteboard image posted at the top of this thread). We were close then, and we're still close now. I keep participating in these conversations (as well as the ones happening in parallel in the tdwg/tnc group, and elsewhere), because I keep hoping that maybe "this time" we'll actually have a breakthrough and reach consensus. I had almost given up all hope, but I have to say that both this thread, and the direction happening over at tdwg/tnc, has boosted my optimism that maybe -- maybe -- we're getting close to consensus on some of this stuff!
Phew, that diatribe took me from breakfast all the way to lunch! Again, sorry for the long post, but there was a lot to cover from what y'all wrote while I slept.
P.S. If I didn't quote/comment on the above, then you can pretty safely assume that I'm a "+1" on the rest of the comments in this thread.
Lots to think about here, and I've some reading to do. As a side note I wanted to comment on identifiers. There are bigger hills to die on, and I know I was just begging to be slapped for bringing up uninomials as identifiers - see also comments on Taxonomic concepts: a possible way forward, - but a few thoughts (and I don't want to derail broader discussion, feel free to completely ignore this).
I guess I'm arguing that it is easy to be dogmatic and say that:
but I think things are more nuanced than that.
Anyway, back to reading the stuff that matters...
Let me just remind everyone that this issue is about what defines a taxon concept in the CoL. The definition of a unique taxon concept in the CoL defines what algorithm we need to compute equality and this stable ids.
The problem is not that we don't have enough identifiers. The problem is that we have too many. All of the boxes on that diagram get a separate TNU identifier. The problem is that there are dozens/hundreds/thousands of potentially relevant other TNUs for congruent circumscriptions, each with their own identifiers, and it's not clear which one to use. For example, there may be many publications that all represent the same set of heterotypic protonyms as shown in box 5 of the whiteboard diagram. Which one becomes the "identifier" for that box? The chronologically first to synonymize A. xus under A. bus? The one who provided the most robust taxonomic treatment? What people seem to want is a single identifier for "Box 5", which may serve as hub for dozens of individual TNUs all asserting the same/congruent taxon. This is where "TNU as surrogate for concept/circumscription" gets messy, and requires a third party to "elevate" one of the TNUs as the surrogate/proxy for the "box". But in any case, the problem isn't a lack of identifiers to use for concepts. The problem is an overabundance of them.
Surely there are many ids and even more usages out there. But that is not what the CoL is about. Our main problem is comparing previous editions of the CoL with the latest version about to be released and to assess under which id to release the taxon under.
The other use case is the Clearinghouse, where we keep many external "checklist" datasets that can act as a source for the CoL, but don't have to. Theses lists (mostly taxonomic trees) come with their own usage ids and we retain them (in contrast to GBIF ChecklistBank where new integer ids are issued). In order to navigate across datasets we have a names index that allows to find the same name across datasets, even, for example, if the authorship was spelled slightly different. Similarily we want to establish a taxon concept index that can be used to find equal concepts across datasets without requiring them to use the same accepted name. I am well aware there are many definitions for both a unique name and taxon concept. For very valid reasons. But for our implementation we need to select one definition that can be used to setup the names and concept index.
As said before, as a starter we will probably try to use the set of protonyms to build the taxon concept index. We are not trying to perfectly model the world of taxonomy and publications. We need something workable in a reasonable amount of time.
As for the style of identifiers we want to use see https://github.com/CatalogueOfLife/backend/issues/491
every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to?
+1 (is it possible to add a "+5"?)
In that case @rdmpage should be happy about CoL and ALA issuing new identifier all the time in every release. But most people including Rod seem to hate that.
Yes, we will have different ids for a name and a usage.
If by this you mean "different classes of identifiers", PLEASE consider this carefully. I went through a LOT of painful mistakes when I did the same thing back in the 1990s; and when I saw the light that Protonyms are a subset of TNUs, the implementation side got MUCH easier.
That is one thing I would like to rollback if I could start again. Separating names and usages seems more of an idealistic thing. So far I do not see any benefits over just having NameUsage instances that have joined properties. And the implementation got way more complex with having names and usages separated.
@mdoering
every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to? +1 (is it possible to add a "+5"?)
In that case @rdmpage should be happy about CoL and ALA issuing new identifier all the time in every release. But most people including Rod seem to hate that.
Because I think the identifier most people will want is a set of usages, not any particular one. A bit like this thread, I can point to an individual comment https://github.com/CatalogueOfLife/general/issues/6#issuecomment-678767729 or the whole thread https://github.com/CatalogueOfLife/general/issues/6. My view is that in most cases, the whole thread ("taxon") is what people will refer to, they'll refer to a comment ("usage") if they feel the need for that level of specificity.
I think this is why people like to link to names, they have enough specificity (that name) and yet enough slop (all mentions of that name). I think ideally taxon ids would have a similar attributes, perhaps with more resilience as they needn't change with changes in name. Otherwise there is limited incentive to link to them (a lot of the work I did in 2018 to link to ALA is now broken because ALA doesn't value identifier stability as much as I do).
@mdoering
Let me just remind everyone that this issue is about what defines a taxon concept in the CoL. The definition of a unique taxon concept in the CoL defines what algorithm we need to compute equality and this stable ids.
OK, we've had our fun now. Apologies for hijacking this thread.
Regarding the specific issue you sk about, can I suggest framing it slightly differently? Presumably you have a classification already (CoL-now). Based on aggregating the data, you have a new classification (Col-future) that you want to release. You want to assign identifiers to taxa in that new release.
For example, currently you have for Opisthotropis balteata (Cope, 1895) the id http://www.catalogueoflife.org/col/details/species/id/e5b7c4081a35d451a9c187e327793765 based on the Reptile database for 2015-12-15. When you ingest the latest Reptile checklist you'll find this is now in the genus Trimerodytes. I would retain the current identifier e5b7c4081a35d451a9c187e327793765d despite the name change - it's moved genera, but in some sense is still the same thing (for various definitions of "same", other definitions are available). Likewise, in most cases like this I would NOT change the id for the genus even if it gains or looses species, as far as the edit script is concerns those nodes don't change.
So, in practical terms, I would do a tree diff between the two classifications to find the minimum number of edits required to convert one tree into another (deletes, inserts, moves). Inserts are easy, that's a new taxon, that's a new id. Moves are typically species from one genus to another, I would retain the same id. Deletes are easy, they no longer exist (kidding). Deletes are likely to be that are newly synonymies names, but I think a way to do that is have the synonym as a child of the accepted name (I think you've done this before when I talked about tree edits a while back).
Now I know that most of this doesn't match the "taxon concept relationship" discussion about how much does something change before it's considered new, but I think most of that is intractable (hence this thread). But I think arriving at a release where the minimum possible number of identifiers change is going to be welcomed by those who link to CoL. The tree diff approach would also enable you to explicitly generate a list of changes (i.e, release notes). In a way by framing it as an information management question (what is the minimum number of operations to convert one tree into another) you can side-step the biological arguments -thus pissing off everyone equally ;)
Hope that is more on topic.
Thanks @rdmpage, that is indeed what I am looking for. A tree comparison is rather difficult on that scale, but let's try that out.
The requirements for a solution are:
Solutions that come to my mind: 1) name based ids - the baseline. This is what we will start out with this year 1) protonym based - stick the id to the protonym and use it for its currently accepted name. This seems to be the same as @rdmpage describes in the tree diff. It requires knowing the basionym, see below 1) protonym set based on analysing the entire synonymy - requires knowing the basionym 1) name with direct parent taxon or even the entire classification. This leads to less stable ids than the name alone. But maybe it is important for users to have a different id if the classification has changed?
As CoL traditionally has not asked for the basionym of a name, it will take a while until we get that information for the majority of names. It is unlikely we will know it ever for all names. But we can augment the GSD information with nomenclators or even other datasets? It is also often rather obvious from the authorship and can be (provisionally) inferred in large number of cases
@mdoering Makes sense to me. Getting basionyms will be a hurdle in some cases, but often guessable from the names (as you've been doing for the GBIF taxonomy), and some databases (IPNI and IndexFungorum explicitly link to basionyms).
If I understand the tree diff approach correctly, then really the only new ids would come from adding nodes (taxa). Moving nodes doesn't change ids, only their relationships change. This makes life simple, but is unlikely to please those who regard taxa as defined, for example, by extension (set of descendants). Perhaps a solution is to store the edits made, so that you can retrieve each node affected by an edit (e.g., a species moving from one genus to another is a deletion from one genus and an addition to another). People could then subscribe to that series of edits and update their own definitions of taxa accordingly.
But back to the topic, regarding scalability, I've not investigated the performance of the code I wrote with Gabriel Valiente forest, but I presume it would be straightforward to partition the CoL classification by major taxonomic group in a divide and conquer approach. Of course, there may also be other/better algorithms and/or tools available.
I am often wrong and never entirely right but will use a made-up story to illustrate the key points in my understanding of what should and should not count as properties of a taxon concept when minting and changing identifiers for them for the COL. My story involves three of us, variously fictionalized. It assumes Rich maintains the COL fish GSD and Markus and I are fish biologists of dubious reputation.
I caught a fish. It's a specimen.
Rich and Markus and I all assess my specimen.
Rich looks at it and says "I don't know how you got this snorkeling in Woods Hole but this is a specimen of Chromis abyssus. When Rich does this, and to paraphrase a past Rich Pyle, he is saying "The specimen you caught is conspecific to the type specimen that I collected and is thus congruent with the concept I had when I caught my specimen and formally described it as a new taxon according to the rules of the code." So my specimen, according to Rich, is an instance of that concept. The concept itself didn't change.
Markus looks at the specimen and sees it a bit differently. He insists Rich has misidentified my specimen and that it is actually a different species, Chromis margaritifer. I don't know why Markus thinks this. But Rich's concept of abyssus still has not changed.
Remsen says "yeah, but look at the tail! Rich said nothing about the tail having a spot" and insists that it's a new species. Rich says "Pfft, not sure that's a spot. I have seen them before." and his next revision of his GSD makes no mention of me and my delusions. My concept doesn't count. He does make a notation of Markus' observation when he updates his GSD.
Chromis abyssus, Pyle 2001 (accepted name) Chromis margaritifer Fowler, 1946, sec Döring 2020, (chresonym) misidentification.
In doing so, he is saying that the fish Markus identified as margaritifer is really just another abyssus. It's not a synonym because the specimen was not a type. So it's a misidentification, according to Rich, and the citation is a so-called chresonym,.
But I'm not done. I do some research, some DNA barcoding, and make a bunch of fancy drawings. I write it all up. I put my specimen (holotype) in a jar and publish my paper in the journal, Calodema, carefully following the rules of the Code. According to those rules, Remsen's concept has now entered the realm of taxonomy and the taxon "Chromis hawkeswoodii" becomes a real (short-lived) species.
During his next revision, Pyle's annotated checklist, published through Aphia, begrudgingly contains some new entries.
Chromis abyssus, Pyle 2001 Chromis hawkeswoodii, Remsen 2020 (heterotypic synonym) Chromis margaritifer Fowler, 1946, sec Döring 2020, (chresonym) misidentification. #this is removed by the COL because it is not a 'real' synonym and should not be used to improve recall in search.
Pyles concept of abyssus hasn't really changed. Pyle would assert that the holotype of hawkeswoodii is conspecific to abyssus. As we saw, he did that originally when my fish was just a specimen, prior to me doing all this work and turning it into a holotype. But since I went to the effort to enter my concept and associated specimen into the pantheon of taxonomy by following the nomenclatural rules, his concept has changed. It includes the original 'protonym:' a nomenclatural term which is really a proxy for Rich's novel concept. It also now contains Remsen's concept of hawkeswoodii and it's associated protonym and holotype. Protonym count went from one to two. Concept changed.
This is essentially how I interpreted the litany of taxonomic publications I reviewed when trying to develop an inclusive taxonomic model with computable concepts. I'm not saying it's right. But I will say it was useful for:
@rdmpage we will always keep the history, so you can use the taxon id and go back in time what it looked like in the CoL in a specific edition. So you can get the entire history for a concept as it appeared in the CoL. That allows people to link to just the id which takes them to the most recent version of it. Or they link to a specific edition of the CoL for which results will be immutable. I think that should give users enough freedom to select the kind of id they need for their purpose.
@rdmpage :
By "hierarchical" identifiers I had in mind the notion of URLs as API, that is, how would someone query the data, and couldn't those queries be expressed as URLs that also serve as identifiers? This leads to a clean interface that gives people the answers they are looking for, and a way to automatically cite the identifier for that information.
Yes, I could definitely get on board with that. I guess whenever I see the word "identifier", I immediately jump to a notion that places most emphasis on "globally unique". Among the things I like about databases are precision and a lack of ambiguity. Part of my infatuation with UUIDs is that when I throw something like 8bdc0735-fea4-4298-83fa-d04f67c3fbec into a resolver engine (Google, ZooBank), there is no ambiguity on a global scale exactly what I'm interested in. Another part is opacity, along the lines of the point made earlier by @mjy
However, more in line with your point, I agree with you that URLs as API also function as identifiers of sort. For example, when I emulated your proposed identifier system in ZooBank:
http://zoobank.org/Search?search_term=abyssus+Pyle+Earle+Greene+2008
Sure enough, I got only one result. In fact, the same is true when I limited it to only the first author: http://zoobank.org/Search?search_term=abyssus+Pyle+2008 [Incidentally, I checked for uniqueness in GNUB using only the first author name, instead of all author names, and I ended up with a nearly identical result of 96.6% uniques; so first author is just as good for this purpose as all authors.]
With a little bit of alteration to the website code, I could make ZooBank follow the "I'm Feeling Lucky" principle and go directly to the record if there is only one result. I could also tweak the code to eliminate the explicit (and unnecessary) "Search?search_term=" bit, so the URL could just be zoobank.org/abyssus+Pyle+2008. [NOTE: I stripped the http prefix on non-functional URLs, so GitHub wouldn't create hyperlinks out of them.]
In that sense, the identifier "zoobank.org/abyssus+Pyle+2008" would indeed be functionally equivalent to http://zoobank.org/8bdc0735-fea4-4298-83fa-d04f67c3fbec. I don't think I would go so far as to index "[abyssus+Pyle+2008] sameAs [8bdc0735-fea4-4298-83fa-d04f67c3fbec]" in bioguid.org; but that doesn't mean your point about URL-APIs as human-friendly identifiers that work 96.6% of the time isn't useful. And sure, I could relax my own idea of the word "identifier" to even think of this as an identifier.
As for "hierarchical", I'm not entirely sure I understand what you mean in that sense, but perhaps what you mean is that instead of "abyssus+Pyle+2008", you could start with just "abyssus" (as in, "zoobank.org/abyssus"). In ZooBank, you'd get four results:
So then you'd need to go to the next level, with something like: zoobank.org/abyssus/Pyle That would get you down to one result, and a likely winner.
So, having no idea what you meant by "hierarchical", I'm imagining my own version of a "hierarchical" API/Identifier system that starts with the first tier of only the epithet, which by itself would (remarkably) get you only one result about 75% of the time. In the 25% of cases where it's ambiguous from the epithet only, going to the next tier and adding the first author name only will get you a single result about 93% of the time. And, as already mentioned, adding the year expands that to 96.6% singletons. Just out of curiosity, using the year as the second tier (instead of author) yields almost exactly the same result as only the author (93% singletons).
OK, I'm rambling now, and so far have only responded to the first point of the first response to my post, and I see there's a lot more yet to read. And it's not even within the scope of his particular thread, as noted by @mdoering
So I'll stop now, as I need to get ready to go out for a dive with my son; but when I come back I'll read through all of the new posts, and will strive to come up with a MUCH more concise and coherent reply.
OK, I lied. One more reply before I go diving.
@mdoering :
Surely there are many ids and even more usages out there. But that is not what the CoL is about. Our main problem is comparing previous editions of the CoL with the latest version about to be released and to assess under which id to release the taxon under.
This is why I've been pushing so hard for CoL to move to a TNU model, rather than some sort of fuzz "name" model. Like all Meta-Authorities (including all the GSDs that provide content to CoL), it should not be in the business of making statements along the lines of "Aus bus is a valid species" and "Aus xus is a synonym of "Aus bus". Instead, it should be making statements along the lines of "We follow Jones 2019 for Aus bus". Because Jones 2019 treated Aus xus as a junior synonym of Aus bus, the synonymy is automatically inherited from the statement.
On a more technical level, here's how it should work: CoL (via GSDs) should anchor all names of valid species to Protonyms. You already have the content to do this, even if you don't have the full literature citation details of the original description. GNUB can provide the UUIDs to every Protonym in CoL -- I can accomplish that in a weekend or two. As long as the GSDs have their own unique identifier, they don't need to incorporate the Protonym UUIDs because CoL (or better yet, BioGUID.org) can maintain the cross-link index. If GSDs don't have persistent unique identifiers... well, then perhaps it's time to retire those GSDs from CoL (or focus on upgrading those GSDs).
So, CoL then becomes an index of all the world's Protonyms that represent valid species. This Index then needs to have only one other piece of information attached to each Protonym record: The TNU for the treatment that "gets it right" for this taxon.
Yes, I know that GSDs don't provide this information, and it's impractical to get them to do so anytime soon. But my point is that the ProtonymID + AcceptedTNUID model should be the defined endpoint for where CoL should be heading. It will never get there if you don't start exploring the actual mechanism to do so. I agree: it's not at all feasible to apply this to all names across all taxa (and all GSDs). But there is a non-trivial amount of content that it could be applied to. All fishes, for example. At the very least, you could explore this as a "Proof of Concept" approach embedded within a more generalized approach, for the subset of records where ProtonymID+AcceptedTNUID are available; while still maintaining the less effective method for recognizing changes based on the combination of [Immediate Parent]+[Heterotypic synonymy]+[Rank]+[Orthography] approach.
Ultimately, CoL should not be in the business of minting its own identifiers. Instead, it should be a broker of TNU identifiers, putting a "gold star" on selected TNUs that serve as surrogates/proxies for the "box", in which all other TNUs sharing identical [Immediate Parent]+[Heterotypic synonymy]+[Rank]+[Orthography] patterns are placed.
I know that's a long way into the future, but if that is defined as the end point now, it will make the road to get there all the smoother.
@deepreef TNUs != OTUs. The former are handled in TW by NOMEN+Citations IIRC. OTUs are what WikiData are doing, just an anonmyous QID + data, some names, some not, I think. Requiring nomenclature to define biological concepts doesn't universally work (bacteria, genetic species concepts), so why not abandon this approach from the get go (don't answer here). In TW we embrace OTUs. Users define a list of OTUs to export to their GSD. We crawl the list of OTUs to find out what nomenclature should come with the list.
Good luck with the tree diff approach. Note that AFAIK CoL doesn't really manage a classification as I think @rdmpage is envisions they do. Until very recently they didn't even return some of the commonly used ranks. The classification that does exist is human constructed based on the Editor appending sectors onto a tree.
I assume that a more complete classification for the purposes here will be built by algorithm. I assume it will have all the same issues GBIF's does. So take that into account when you assume stability of identifiers embedding information derived from algorithms. For example, one species of tenebrionid appearing in 4 kingdoms by the time it gets to GBIF collapses the consensus, to use another tree-based concept.
Oh, you'll also need to embed versioning into the whole system, as the algorithm will clearly evolve as you struggle to find any use for it. Each commit to the algorithm will render past identifiers for concepts meaningless, as it will no longer have the same rules, and trying to figure out what changed between versions with respect to species concepts will only be useful as a sadistic test for graduate students taking computer science prelims. ;)
@mjy I'm not quite so pessimistic, but don't have data to argue the point. The tree diffs needn't operate on CoL itself, they could be applied to the input classifications from the source databases (e.g., the reptile database mentioned above).
@rdmpage Right, point taken, larger classification not needed. I think given this fact there is nothing preventing the experiment to start right now:
I.e. there are no bottlenecks beyond time to this experiment.
Back from diving, and lots to think about/discuss. But quick for right now to @mjy "Off Topic", which I actually think is very much "On-topic", because IIUC (not sure if that's a thing, = "If I Understand Correctly"), @mdoering is trying to answer the broad question "When do I mint a new CoL Identifier, vs. when do I modify properties associated with an existing identifier?" (CMIIW, @mdoering ). The simple answer to that question is, "When the concept/circumscription is different!" But that's not a very useful answer, because we haven't yet answered the prerequisite questions, "What is a concept?", "Is it the same as a circumscription, or different?", and more to the point, "What are the core properties of a concept/circumscription such that a change in one of these properties results in an implied different concept/circumscription?"
So, in that context, the clarification that "TNU != OTU" is both very helpful and very relevant to these prerequisite questions.
To start, a bit of clarification of my own. Although the "N" part of TNU is often assumed to be a Linnean-style scientific name (and that's where most of our focus has been), that's not necessarily the only context in which the "N" part applies. There's been some discussion of this over at tdwg/tnc, but I would certainly include some classes of non-Linnean names (and some advocate for opening it to all text-string labels, including vernaculars/etc.) But the point is, Linnean-style nomenclature is absolutely not required for TNUs to work either. But I'm pretty sure that the "T" part of TNU is the same as an OTU (if not, then CMIIW).
So here are some questions about OTUs in this context (i.e., the WikiData notion of it, as adopted by TW):
I think these questions are on-topic for the issue sought by @mdoering, because if a CoL "thing" is the same as a WikiData/TW OTU "thing", then understanding the logic behind how new QIDs are minted vs. amended of OTUs might directly address the same question in the CoL context.
PS, Before I wrote the above, I didn't know that "CMIIW" was a thing, but evidently it is. I also just now learned that IIUC is a thing too.
Your simple answer is useful, Rich, because it's a good start. You mint a new identifier if, and only if, the concept changes. Anything else and your identifier must be referring to something else. A concept changes when something is added to it or removed from it. What is that something? It's clear that one answer, at least, is that a concept changes when you add other taxa to it or split taxa (new or previously included) from it.
@deepreef we all have ideas about what how identifiers should be minted for OTUs, @dremsen's ideas are perfectly fine. We know that we need new IDs for new concepts.
Any aggregating system in place to track all concepts on Earth must be setup to handle the simplest of all cases- the curator of the GSD provides a unique identifier that they assert will only change if their curated concept changes. If you don't have a data model for this basic use case in place, then you're not handling the best case scenario.
To me, the OTUid (QID for example here, but really could be a big UUID, whatever - just no meaning plz) coming from the curator of a GSD (these species concepts don't just come from nowhere, they come from blessed lists of various quality as curated by a human) is the single best way to track differences. If the curator changes the id, they understand that they are asserting a new taxon concept. The way we teach them to think about this is that if you had concept A, and you did science 1, and then concept B, and science 1, you hypothesize that you might get a different answer. We force curators to think of list of OTUs, not list of names because the CoL is a list of OTUs, and the names we can use to get near to them.
I wish I had my philosophy of science notes from undergrad back in front of me. The course so elegantly pointed out all the problems trying to uniquely identify things. Definitions based on sets, expanding and contracting definitions, all chairs and not chairs, etc. etc. All of them failed in some cases. This is extremely well understood philosophically. The exercise here would fit right into one of those bodies of thought. What to do then? At the end of the day, what you need are meaningful units. What is a meaningful unit in our case? The thing you can do science with. What thing? A species concept, something "real". That unit, gets a single, anonymous ID, Q, or other meaningless URI, etc.
To your questions:
Do they always have some sort of text-string label associated with them? I'm assuming the QID at least, but is that the only way to cite them?
What properties of the "Data" part help you determine whether you're dealing with a new instance of an existing QID-branded OTU, vs. an OTU that requires the minting of a new QID?
TLDR - I don't believe we can do better without a different data model at the core ("anonymous" nomenclature free concepts), and better tools and processes for GSD curators.
@dremsen 👍 I think that's exactly right! But when you say "add other taxa", at the species level what that means is that you are adding another heterotypic synonym, which means you're adding a new type specimen to the concept. However, it's not that simple. First, there are all the OTUs that don't have Linnean-style names. I fully agree with @mjy that requiring [Linnean-style] nomenclature to define biological concepts doesn't universally work. So the "type specimens as boundary markers for concept circumscriptions" can only go so far (i.e., can only really work int he context of taxa signified with Linnean-style names anchored to name-bearing types).
Second, consider this scenario:
Aus bus Smith 1950 sensu Smith 1950
)Aus xus Jones 2010 sensu Jones 2010
; TNU: Aus bus Smith 1950 sensu Jones 2010
)Aus xus Jones 2010 sensu Pyle 2015
; with TNU: Aus bus Smith 1950 sensu Pyle 2015
as a heterotypic synonym)We've got five TNUs here, four of which represent taxa asserted to be valid. The fifth TNU is Pyle's assertion that the type specimen of Aus xus is conspecific with the type specimen of Aus bus, and because Aus bus has priority, his (Pyle's) concept is labelled as "Aus bus", but it includes both Jones' concept of Aus bus and Jones' concept of Aus xus (not always the case, but for sake of simplicity, let's say it's true in this case).
So, suppose the 2009 CoL has ID1234 associated with Aus bus, which we'll infer to be Aus bus Smith 1950 sensu Smith 1950
.
Now Jones comes along in 2010 and names Aus xus, so CoL mints a new ID9876 for Aus xus Jones 2010 sensu Jones 2010
to include in its 2011 Catalogue.
Here's the kicker: Does CoL issue a new ID for Aus bus? If so, why? How would CoL ever know whether this is a case of Aus bus being "split" into two species by Jones, or it's just a new discovery of a new sister-species (Aus xus) to the already established Aus bus?
The problem is that Smith 1950 didn't examine any specimens from Palau, so we have no idea whether he would have included specimens from Palau within his circumscription of A. bus, or if he would have agreed with Jones that the Palauan species is different. So at this stage, CoL can't decide, based on the information it has, whether it's representation of Aus bus needs a new ID, or can keep using the same ID.
However, suppose that CoL has a TNU-based model, and for its 2009 catalogue it anchored the record for Aus bus to the treatment of Remsen 2005 (TNU: Aus bus Smith 1950 sensu Remsen 2005
). With a little bit of @nfranz - style sleuthing, we discover that Remsen examined specimens from Palau and declared them to be Aus bus. Now we have a good idea that CoL had defined its record for Aus bus s.l., so by recognizing a portion of this circumscription in the form of Aus xus Jones 2010, we know that a new s.s. circumscription is needed for the CoL record of Aus bus, and this a new ID is created for Aus bus s.s. to distinguish it from the earlier CoL record with ID1234.
Of course, it's rarely the case that there are only two alternatives, so "s.l." vs. "s.s." is kind of useless. A MUCH better approach is to, instead of "sensu lato" and "sensu stricto", CoL explicitly uses "sensu Remsen 2005" and "sensu Jones 2010" (respectively).
The problem, though, is that it takes a bit of @nfranz - style sleuthing to make this determination, and CoL can't incorporate that information into its records. However, it can make something of Aus bus Smith 1950 sensu Pyle 2015
, because this TNU also reveals the second type-specimen-by-proxy of the protonym link embedded within Aus xus Jones 2010 sensu Pyle 2015
pointing to Aus xus Jones 2010 sensu Jones 2010
.
If anyone actually followed that, I'm deeply impressed (I had to re-read it several times myself, and I still probably screwed something up). But here's the short summary point: With a TNU model, you can do a pretty powerful job reasoning/computing backwards through time (e.g., comparing Aus bus Smith 1950 sensu Pyle 2015
to Aus bus Smith 1950 sensu Jones 2010
), but it's much harder to reason forward in time (e.g., comparing Aus bus Smith 1950 sensu Smith 1950
to Aus bus Smith 1950 sensu Jones 2010
).
OK, more to come, but I'm approaching this one point at a time.
And it seemed we were getting so close to a resolution of the issue...
My sense from this discussion is that there are (at least) two different approaches to the topic.
ids should reflect our knowledge of taxa, and two taxa have the same id only if they are the same. If taxa change, they get a new id. I note that agreement on the meaning of same and change seems, um, elusive (cue numerous "A. us, A. bus" discussions), but I digress. Hence with each interaction you want ids that faithfully reflect current taxonomic understanding, and hence reflect changes in taxa (however defined). One consequence is that downstream users of these ids (e.g. people linking to them in their own databases) will be faced with regular changes to some (most?) ids.
ids should be as stable as possible so that they provide a reliable basis for external linking (e.g., by downstream users, Wikidata, etc.). Hence with each iteration, the goal is to minimise changes in ids. Downstream users will be able to link with confidence that the id is likely to be stable, with the proviso that what the id represents may itself have changed in ways that some users would consider meaningful (e.g., a genus has acquired additional species from another genus).
I am not sure we can do both, so I think the real question is which outcome best reflects CoL's goals? I'm guessing it's no surprise that I value identifier stability (2) more than fidelity to particular taxonomy (1), so I would vote for 2. This also means that I regard the "A. us, A. bus" discussions as essentially beside the point. The likelihood of me using CoL identifiers is mostly a function of their stability and how interconnected they are with other identifiers.
Obviously, if faithful representation of what the ids point to matters more (e.g., you can't accept that a genus with the same name but different component species can have the same id), then you will favour 1, and then the crucial issue is defining a set of criteria for determining identity of taxa (@mdoering original question before the rowdy neighbours turned up with alcohol and music).
In a sense these aren't completely different positions (obviously 2 still depends on some notion of same, in this case similarity of edges in the graph i.e., parent → child pairs having the same labels) but it seems to me that 1 is effectively blocked in the absence of agreement on the operational meaning of same. Likewise as @mjy has pointed out, advocates of 2 (e.g., me) have argued for a simple tree diff approach without demonstrating a working system.
So, in summary, it seems to me that there are two separate goals here: fidelity to changing concepts vs stability of identifiers in the face of change.
@mjy
Any aggregating system in place to track all concepts on Earth must be setup to handle the simplest of all cases- the curator of the GSD provides a unique identifier that they assert will only change if their curated concept changes. If you don't have a data model for this basic use case in place, then you're not handling the best case scenario.
The CoL model does handle that obviously. But CoL deals with very heterogenous data from a wide range of sources (we prefer to avoid the term GSD as the sources are often not "global" and also not limited to "species"). Some do have ids, some do not at all. And what they represent we hardly ever know. It might be database records that change their id by some evil algorithm. It might be name identifiers, it might be "OTU" identifiers. We do not know. But even if we did have ids for OTUs from each and every source, they would never apply the same methods or rules for defining a concept. It's different between larger taxonomic groups, it might be a more molecular driven, it could be more or less phylogeny driven, it could be more of a splitter or lumper philosophy. It surely is never consistent. You could argue we do refer to the original source and can just forward the responsibility of the idea of a concept to them. But for an end use the CoL becomes even more heterogenuous and they would have a hard time understanding what that id means and if they can trust them for their purpose. My main reasons for having a genuine, consistent CoL identifier based on some agreed method are:
It would be an option to use a hybrid approach and treat source differently. We could mark manually selected sources as having properly curated taxon identifiers and blindly follow their changes while others fall back to the default CoL provided ids.
Reptile DB btw does not even expose their identifiers but instead recommends to link to their species pages by their search by name URL: http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis
All: This is probably the most useful discussion I've had in months (if not years), because it actually feels like we're getting somewhere on a topic where wheels have been spinning and spinning. So fair warning and apology, much more to come.
Here I want to "see" the hypothetical from @dremsen and "raise" it into an actual, real-world example from my dive today. But first, a nit-pick:
to paraphrase a past Rich Pyle, he is saying "The specimen you caught is conspecific to the type specimen that I collected and is thus congruent with the concept I had when I caught my specimen and formally described it as a new taxon according to the rules of the code."
The first part of that is right, but the "congruent" part is a bit off. I would actually phrase it as:
"In my taxonomic opinion, tThe specimen you caught is conspecific to the type specimen that I collected, and is thus congruent with thus falls within the species-level concept I had when I caught my specimen and formally described it as a new taxon and established my specimen as the name-bearing type according to the rules of the code. Because no other earlier-established name-bearing type falls within my concept, then the correct name for my concept, and thus your specimen, is Chromis abyssus."
Just because I include your specimen within the same circumscription that I have in my head for C. abyssus doesn't mean that my concept is necessarily congruent with any other concept.
Anyway, getting back to the real-world example. This is a cropped frame grab from a video I took today:
It's in the same genus as the one in the @dremsen hypothetical (Chromis), but this one lives shallow and is probably the most common species of its genus in many places where it lives.
As an Ichthyologist born and raised on Hawaiian reefs, I have no trouble identifying this as Chromis agilis, described by Smith, 1960 (see Protonym in ZooBank). Don't take my word for it, check it out yourself.
CoL cites FishBase as the source database (GSD), where the online resource is. Going to that link reveals a distribution map showing broad distribution across the Indo-Pacific, and cites Allen 1991 as the "Main Reference". The record in WoRMS is derived from the same source.
Here is the record in ITIS. And here it is in Catalog of Fishes.
This is about as stable as taxonomy gets. At least it was... until last week, when this was published.
You can read the PDF if you want, but the short story is that Allen & Erdmann came to the conclusion that the Pacific populations represent a different species from those in the Indian Ocean. The type specimen of C. agilis is from the Seychelles, and it turns out that the taxonomy has been so stable since 1960, that no synonyms have ever been described from anywhere else (including the Pacific). So Allen & Erdmann decided to describe the new species [Chromis pacifica], based on a type specimen collected in the Coral Sea.
So... we have 33 TNUs in GNUB hooked into the Protonym for C. agilis:
Chromis agilis Smith, 1960 sensu Smith 1960
Chromis agilis Smith, 1960 sensu Randall & Swerdloff 1973
Chromis agilis Smith, 1960 sensu Okamoto & Kanenaka 1984
Chromis agilis Smith, 1960 sensu Randall, Lobel & Chave 1985
Chromis agilis Smith, 1960 sensu Allen 1986
Chromis agilis Smith, 1960 sensu Randall, Allen & Steene 1990
Chromis agilis Smith, 1960 sensu Allen 1991
Chromis agilis Smith, 1960 sensu Severns & Fiene-Severns 1993
Chromis agilis Smith, 1960 sensu Irving, Jamieson & Randall 1995
Chromis agilis Smith, 1960 sensu Senou & Morita 1995
Chromis agilis Smith, 1960 sensu Randall, Allen & Steene 1997
Chromis agilis Smith, 1960 sensu Randall, Ida, Kato, Pyle & Earle 1997
Chromis agilis Smith, 1960 sensu Myers 1999
Chromis agilis Smith, 1960 sensu Fricke 1999
Chromis agilis Smith, 1960 sensu Randall 1999
Chromis agilis Smith, 1960 sensu Nakabo 2000
Chromis agilis Smith, 1960 sensu Laboute & Grandperrin 2000
Chromis agilis Smith, 1960 sensu Allen 2001
Chromis agilis Smith, 1960 sensu Coles, DeFelice & Minton 2001
Chromis agilis Smith, 1960 sensu Nakabo 2002
Chromis agilis Smith, 1960 sensu Myers & Donaldson 2003
Chromis agilis Smith, 1960 sensu Randall, Williams, Smith, Kulbicki, Mou Tham, Labrosse & Kronen 2004
Chromis agilis Smith, 1960 sensu Lobel & Lobel 2004
Chromis agilis Smith, 1960 sensu Eschmeyer 2004
Chromis agilis Smith, 1960 sensu Mundy 2005
Chromis agilis Smith, 1960 sensu Allen 2005
Chromis agilis Smith, 1960 sensu Randall 2005
Chromis agilis Smith, 1960 sensu Allen, Cross & Allen 2006
Chromis agilis Smith, 1960 sensu Randall 2007
Chromis agilis Smith, 1960 sensu Fricke, Mulochau, Durville, Chabanet, Tessier & Letourneur 2009
Chromis agilis Smith, 1960 sensu Quéro, Spitz & Vayne 2010
Chromis agilis Smith, 1960 sensu Fricke, Durville, Bernardi, Borsa, Mou-Tham & Chabanet 2013
Chromis agilis Smith, 1960 sensu Allen & Erdmann 2020
Here's the challenge: How many OTUs are there? Is this the same as the number of CoLID values there should be? What additional information would you need to determine how many OTUs?
In my proposed pathway to salvation, I would have CoL harvest one more piece of information from the GSD source record for C. agilis: the TNU for FishBase's "Main Reference". It's in the list above as Chromis agilis Smith, 1960 sensu Allen 1991
. It would take me about one weekend to hook all the existing CoLID values derived from FishBase into the corresponding GNUB Protonyms and FishBase "accepted" TNU values.
In the next cut of FishBase that is imported into CoL, you would note two things:
1) The addition of a new Protonym for Chromis pacifica Allen & Erdmann, 2020 sensu Allen & Erdmann 2020
2) A new "Main Reference" from FishBase in their record for C. agilis, pointing to Chromis agilis Smith, 1960 sensu Allen & Erdmann 2020
Thus, CoL would mint a new ID for C. pacifica (because it's a new name not previously imported into CoL), and would mint a new ID for C. agilis (because the "accepted" TNU from the source GSD changed).
In the long run, CoL would stop minting IDs altogether, and simple make statments along the lines of:
"With regard to Protonym Chromis agilis Smith, 1960 sensu Smith 1960
, we defer to FishBase, who follows Chromis agilis Smith, 1960 sensu Allen, Erdmann 2020
You could cache a bunch of other metadata, of course, but the core service provided by CoL would be an endorsement of Chromis agilis Smith, 1960 sensu Allen, Erdmann 2020
, determined via FishBase.
OK, more on the rest of @dremsen 's hypothetical post in a moment.
@mdoering
Reptile DB btw does not even expose their identifiers but instead recommends to link to their species pages by their search by name URL: http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis
Yes, which strikes me as bad design, made worse by the result if you use the URL that would have applied to this species before: http://reptile-database.reptarium.cz/species?genus=Sinonatrix&species=yunnanensis :
Species Sinonatrix yunnanensis was not found! You can try find it as synonym, or use advanced search for searching it other way.
Given that Reptile DB knows that Sinonatrix yunnanensis is a synonym of Trimerodytes yunnanensis I don't understand why they can't just take you to Trimerodytes yunnanensis !?
Anyway, the integer ids I use are in the database dumps, and on first glance seem stable across releases of the Reptile DB checklist.
@dremsen :
Chromis abyssus, Pyle 2001 Chromis hawkeswoodii, Remsen 2020 (heterotypic synonym) Chromis margaritifer Fowler, 1946, sec Döring 2020, (chresonym) misidentification. #this is removed by the COL because it is not a 'real' synonym and should not be used to improve recall in search.
The only way that third one has any place in this discussion about circumscriptions/concepts is if you're going for the extrinsic approach of defining concepts/circumscriptions by enumerating lots and lots of individual organisms. We must have different interpretations of the meaning of "chresonym" (a term I've never liked, or used); because I do not see that third one as a chresonym. I don't even see it as playing a role in taxonomy. It's a dispute about the identification of a particular organism, which is a whole different thing from reasoning across taxon concepts.
Pyles concept of abyssus hasn't really changed. Pyle would assert that the holotype of hawkeswoodii is conspecific to abyssus. As we saw, he did that originally when my fish was just a specimen, prior to me doing all this work and turning it into a holotype. But since I went to the effort to enter my concept and associated specimen into the pantheon of taxonomy by following the nomenclatural rules, his concept has changed. It includes the original 'protonym:' a nomenclatural term which is really a proxy for Rich's novel concept. It also now contains Remsen's concept of hawkeswoodii and it's associated protonym and holotype. Protonym count went from one to two. Concept changed.
Exactly! This is what I was trying to get at. We don't know the relationship between Chromis abyssus, Pyle 2001 sec Pyle 2001
and Chromis abyssus, Pyle 2001 sec Pyle 2020
. But we do know the relationship between Chromis abyssus, Pyle 2001 sec Pyle 2020
and Chromis abyssus, Pyle 2001 sec Remsen 2020
(assuming Remsen regarded C. abyssus as a valid and distinct species). That's because both Protonyms are referenced in both publications, so there is computable logic here. We don't know how Chromis abyssus, Pyle 2001 sec Pyle 2001
relates to the others unless we do some @nfranz -level sleuthing.
OK, I'll stop replying until I'm caught up reading.
Dear all, What an interring discussion! but difficult to follow getting in it today after 60 emails at my counter… A few quick thoughts even if I’m not sure, they are relevant to this discussion...
Taxonomy knowledge versus taxonomy usage. I think we need to separate taxonomy knowledge and taxonomy focal usages/practices of taxonomy (meeting specific needs). In the digital sphere the first needs a complete formalisation of what is a taxon, in the latter one tolerates/accommodates with some ambiguity because it serves/answers to local/focal purposes. If we could achieve the first, we could easily take what is needed in it to operate the second, but starting will focussing on the second, we‘ll have to reinvent the wheel each times from the local/focal objectives they want to serve. And thus we get the current landscape with lot of ways (tools, identifiers, practices …) to address taxonomy according to specific interests. I already mentioned this in CoL’s 2017 Wood hole meeting and discussed it with Rich and it is the spirit of the why my paper (still not published) about taxon formalisation that several of you have already read (or reviewed): the goal should be first how to transfer/translate taxonomy knowledge in the digital sphere even if trying meeting the needs is of course necessary. All this discussion shows well that a complete formalisation of what is a taxon and how to represent it in the digital is still a pending issue.
Approximation in terms. If we agree that a concept has (at least) 3 major properties (Name (N), Taxon defined by circumscription (Tc) and Taxon divined in intension (Ti), then the taxon concept we use in the new CoL represents only Tc, not the the complete taxon (T). This is a semantic shortcut we need to be aware of when looking for taxonomic identifiers beside CoL. The best that can address the new CoL is identifiers for N, Tc, N+tc but not for T that CoL does not address completly! There is no taxon concept in CoL. These 3 properties are clear I think for everyone but there is probably a forth one: its dynamic component (biological nature, conceptual perception).
Identifying what? Taxon identifiers are needed for the practice of taxonomy itself and for external usage of it. However having them, they fixe the taxon as a static entity while it is a involving concept from both its biological nature and its conceptual perception. I know quite nothing about identifiers but at least such an identifier should be able to address this paradox. Addressing the issue by any subset of N, Tc and Ti would fail to identify fully a taxon (but some subsets might be enough to answer specific needs). Using names only has shown to be inappropriate. Using circumscription (Tc) only (or with names) remains incomplete and addresses the concept of the taxon (not the taxon!) and part of the concept only. Its take into account its usage (children taxa) and is approached by capturing the taxonomic literature. However circumscription is not only about children taxa: each time a new ref is added, the concept of the taxa addressed is also changed because it encompasses all the biological attributes associated (i.e. taxon properties) with the specimens it groups: what encompassed Drosophila Fallén in 1830 is totally different of today in terms of children taxa of course, but also in terms of its distribution, ecology, … we are referring to the same biological entity (the taxa) but no longer to the same concept. Tracking the taxon name usage is not sufficient to formalize complety the taxon as a biological entity. Similarily, each time a taxa is moved in the classification, its definition (intension) changes (the topic of my paper): we have same biological entity (the taxon) but not the same concept. - [and by the way: tracking all changes by circumscription (= tracking all occurrences) is no less an enormous task than tracking all changes by intension (= tracking all classifications changes): if we agreed to do the first (=GBIF) we could also do the second] -.
In other terms, for the usage of taxonomy we want/need taxon identifier for taxon as a biological entity, which is neither its name, neither its concept in any of its definitions But all of them are useful to represent part od the taxonomic knowledge in the digital sphere. As put in my paper, the Berendsohn notation "Aus bus Author, Date, sec. Author Date” (is Rich ‘sensu’ a similar one?) remains the best way for me to identify clearly a taxon, even if 'sec Author Date’ represent itself a concept (concept of classification) that also itself is not static, is also hierarchic (sec Author Date in a higher system of classification sec Author2, date), and evolves with progress of taxonomy/phylogeny knowledge. A taxon identifier focussing on such statement is probably the best solution we could have for the moment.
BW, Th.
Le 24 août 2020 à 10:56, Roderic Page notifications@github.com a écrit :
@mdoering https://github.com/mdoering Reptile DB btw does not even expose their identifiers but instead recommends to link to their species pages by their search by name URL: http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis Yes, which strikes me as bad design, made worse by the result if you use the URL that would have applied to this species before: http://reptile-database.reptarium.cz/species?genus=Sinonatrix&species=yunnanensis http://reptile-database.reptarium.cz/species?genus=Sinonatrix&species=yunnanensis :
Species Sinonatrix yunnanensis was not found! You can try find it as synonym, or use advanced search for searching it other way.
Given that Reptile DB knows that Sinonatrix yunnanensis is a synonym of Trimerodytes yunnanensis I don't understand why they can't just take you to Trimerodytes yunnanensis !?
Anyway, the integer ids I use are in the database dumps, and on first glance seem stable across releases of the Reptile DB checklist.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/general/issues/6#issuecomment-679000713, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZIOGZ2LUP3IJS2SNXHZPDSCITMRANCNFSM4DKBXVWA.
Define rules for a stable taxonID. Understanding when a taxon changes sufficiently to warrant an identifier change