ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

botanical infraspecific taxon names #7962

Open dustymc opened 1 month ago

dustymc commented 1 month ago

From the Arctos logs, I ended up on https://arctos.database.museum/name/Carex%20scita%20subsp.%20scabrinervia and noticed that several sources are doing something that seems a lot more "connectable" than what Arctos is doing.

The Arctos name is

Carex scita subsp. scabrinervia

and one of the classification terms is

Carex scita subsp. scabrinervia (subspecies)

EVERY other source uses something like

Carex scita scabrinervia (variety)

There are some variations in the rank and such, but NOBODY else interrupts the name with those unpredictable (and I believe not terribly consistent over time) ranks, only Arctos.

At some point I'd understood that the ranks were an unavoidable part of the name, and there's no possibility of "correctly" doing anything other than what we're doing. That does not seem to be true anymore, and I suspect we're cutting ourselves off from effectively communicating with some part of the Extended Specimen Network.

There are 218787 botanical-name-like Names in Arctos at the moment. (Those are essentially "has dot, not meteorite" - I didn't dig very deep, just enough to hopefully understand the scope.)

There are 2086 "clean" corresponding names eg

There are another 21025 almost-duplicates with various forms of the 'rank-having name' eg

I'm impressed at how clean that is, but still cutting 10% of a collection off from the world isn't great (and those are just the super-obvious matches, I didn't look at any depth).

Is there some possibility of Arctos doing something more predictable with taxon names?

https://github.com/ArctosDB/arctos/issues/7941 is potentially related - I believe it's a request to somehow handle something similar to this, "clean" names and "metadata-bearing" identifications, but I'm awaiting clarification and might be lost.

@camwebb

Involved ranks:


 stripped_string | count  
-----------------+--------
 subhybr.        |      3
 modif.          |      4
 mut.            |     10
 agamosp.        |     13
 prol.           |     28
 nothof.         |     29
 monstr.         |     32
 nm.             |     84
 lus.            |    122
 subf.           |    255
 f.              |  14271
 subsp.          |  46758
 var.            | 157178

Data:

temp_name_with_dot.csv.zip

camwebb commented 1 month ago

EVERY other source uses something like

Carex scita scabrinervia (variety)

There are some variations in the rank and such, but NOBODY else interrupts the name with those unpredictable (and I believe not terribly consistent over time) ranks, only Arctos.

I don't quite understand the concern. For "Carex scita scabrinervia (variety)" in WFO the canonical name is still "Carex scita var. scabrinervia" and it's the canonical names that form the links across DBs. Carex scita var. scabrinervia and Carex scita subsp. scabrinervia are two different names and whether we model those as i) name='Carex scita scabrinervia'; rank=var. and name='Carex scita scabrinervia'; rank=subsp. or ii) name='Carex scita var. scabrinervia' and name='Carex scita subsp. scabrinervia' (as we currently do) seems immaterial.

I suspect we're cutting ourselves off from effectively communicating with some part of the Extended Specimen Network.

Please elaborate.

dustymc commented 1 month ago

whether we model those as i) name='Carex scita scabrinervia'; rank=var. ... seems immaterial

That's awesome to hear. It's not, this format would let Arctos detect and prevent some things that are almost assuredly hiding data and make everything more accessible. If it's immaterial to you then maybe this has an easy solution, which is always nice.

That is, the data in Arctos currently assert that Vauquelinia corymbosa subsp. angustifolia and Vauquelinia corymbosa angustifolia are different THINGS, different circumscriptions, somehow not the same thing, would never be used interchangeably for plants of the same lineage, and both need to exist in Arctos (along with the Baccharis alpina {various things} alpina stuff and etc.) I don't think that's the case - I'm kinda-sure that all of those variations are aimed at the same THING, and that the Code doesn't allow multiple names which vary only by rank anyway. If that's more or less true and the name format doesn't matter to you, I'll make a proposal.

Please elaborate.

Someone (globalnames? The sources feeding GN? Both?!) seems to find enough value in the 'deranked' strings to make some things more consistent. If I'm reading that right then perhaps we should follow suit.

I'll also elaborate on https://github.com/ArctosDB/arctos/issues/7941. I think botanists (and clearly some entomologists and maybe just everybody) likes having more than the 'bare names' close to the records, and I suspect that's the origin of the "botanical format" we're currently allowing. I'm not sure what form 7941 might take (I can't tell if it's a display thing or an assertion thing at the moment - I think the latter), but either way when you say Vauquelinia corymbosa angustifolia I suspect you're really wanting me to magick up something like Vauquelinia corymbosa subsp. angustifolia (Rydb.) W.J.Hess & Henrickson (some fancy thing, not just a name) and slap it on the current Arctos UI (if it's a display request) or store that with the record so everyone can get it from anywhere (if it's more than a UI request, which would be more powerful but also perhaps require a bit more attention in a couple places). Either way, more predictable names is just going to make that easier, and maybe I can solicit input from a larger community (which kinda always makes things better) in the process.

If any of that's starting to sound like something that might be actionable, maybe we could schedule a zoom to make sure we're all on the same page before I try to turn anything into a proposal? And please let me know if I'm lost, confused, or wasting my time, if this isn't going to result in better and more accessible data (and generally happy botanists!) then we can kill it now.

camwebb commented 1 month ago

So I checked the ICN code and "A b var. c Foo" and "A b subsp. c Bar", which are names attached to two different circumscriptions, are acceptable (and are common in practice) as long as they both have the same type; if they don't have the same type they are homonyms. As a user I need to be able to specify that an identification was to one or the other of those two, and "A b c" won't do, hence the method we currently have (putting infraspecific ranks in the Arctos name).

This is a related issue to that discussed in #6500. As users we would like to be able to specify that an identification was to "A b Foo" not to "A b Bar", but this is not possible in the current Arctos model. I am going to add alternate authors to the classification in UAM Plants (still pending, but see example). In the same way, I could generalize this solution to infraspecific rank and add 1_full_name: A b var. c Foo and 2_full_name: A b subsp. c Bar to the classification, with an identification only to "A b c". But there is lost information here: we know from the label that the identification was to taxon "A b var. c Foo" and yet we loose that specificity if we just record "A b c".

One solution to both of these problems (lack of specificity in author string and infraspecific rank) is to use the existing taxon concept infrastructure we built :smile: The identification could specify a taxon concept of "A b var. c Foo sec. unknown" with the scientific_name reduced to "A b c". I've demo'd this with name and concepts and an identification.

Now how to do this for tens of thousands of existing records...? Phew

dustymc commented 1 month ago

One solution

Dang, I was getting all wound up to point that out!

as long as they both have the same type

You mean like type like Type like holotype? They are defined (by such type) as EXACTLY the same circumscription, and we are making it impossible to find them all with one query by putting them in two buckets?! Don't seem ideal....

If I'm reading that correctly at all, I'd suggest that a user confronted with the reality of actually needing to convey 'this was called A and this was called B but those are, at best, minor variations in interpretation of the same idea' will find a way to do do, and the ~20K ways we have of NOT allowing a user to find all of that material are a much greater/realized/actual concern.

It's easy to find real example of this, https://arctos.database.museum/search.cfm?taxon_name_id=10953260 will (if you have the rights) find one UAM:Alg Fucus distichus subsp. edentatus and https://arctos.database.museum/search.cfm?taxon_name_id=12128592 (still assuming rights) will find one OGL:Genomic Fucus distichus edentatus. I think any researcher would probably want both of those, and would not expect to find them separated into two 'piles.' Search for either one of those things and you'll get half the available records records, there are no clear/unavoidable paths to finding both, nor having found one to realizing that the other might exist. I think nearly everyone will find some of what they want and leave, never realizing they've missed stuff (half in my 'first thing I found' example, potentially much worse ratios and many more records).

I think the only defensible explanation of that (assuming I'm adequately understanding the situation) is "Arctos is facilitating denormalization." I'd think everyone would agree that's a 'must fix,' whatever mechanism we might use.

specify that an identification was to one or the other of those two,

My super-vague idea is that 7941 might result in some sort of a 'magic upgrade plain IDs when possible' collection setting (or SOMETHING - I'd clearly need a lot more information to do anything). If that's the case, the default might be to create a 'vanilla' ID of Vauquelinia corymbosa angustifolia and let Arctos (somehow....) magic that into Vauquelinia corymbosa subsp. angustifolia (Rydb.) W.J.Hess & Henrickson. (Or just display the latter in certain UI, or WHATEVER.) Nothing about any form or variation of that process would prevent you from explicitly feeding it precisely whatever ID string you want - Vauquelinia corymbosa supersubunderoverorder. angustifolia (me) I just made it up! is and will always remain a possibility. A user search Vauquelinia corymbosa angustifolia will reliably get both of those identification variants. (And a user wanting the super-specific variation can search identification - not taxonomy - to get that as well.) Another user cataloging similar material isn't likely to use anything other than Vauquelinia corymbosa angustifolia in that process. Removing the ranks from the names would prevent denormalization, and I think still allow you to adequately deal with any fringe use cases. (And if not we can look at some other mechanism to prevent denormalization, or just acknowledge that such things aren't possible under the ICBN and 'good luck, fair user!' or WHATEVER - I'm still just trying to understand if this is a viable plan or not, what details might need addressed if it is, what alternatives might exist if it isn't, etc.)

taxon concept

I suspect I won't be able to magic anything to that level of specificity (we'd not have that level of specificity if I could!), but yes, taxon concepts are a more-unambiguous way of clearly stating what the metadata-bearing identifications hint at. I suppose ideally that's all anyone would ever use for anything, but here we are....

(And what is https://arctos.database.museum/publication/10012230 trying to accomplish?! That CANNOT result in more than a name pretending to be a concept, can it??)

add alternate authors to the classification

I'm not sure I understand the goals, but that would possibly complicate any attempts to "auto-upgrade" identifications - at least in my super-vague thinking, that would require something like a single 'display_name' and I think maybe that couldn't be an appropriate concept in a - uhh, composite classification?? (And maybe anything you'd do that for needs a human being specific involved anyway?)

add 1_full_name: A b var. c Foo and 2_full_name: A b subsp. c Bar to the classification

You can, but I'm not sure why you'd prefer that over identifications? In any case I think this is another indication that those are orthographic variations of the same THING, which I'll take as evidence that this issue is on the right track even if there are details to understand.

know from the label

That's why https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxa_formula#a__string_ and identifications themselves exist. You can just transcribe EXACTLY whatever the label-maker wrote as an identification (and then you can add any number of alternate IDs as "translations" if necessary). This does not require (what I think is) denormalization (saying one thing in multiple ways) in taxonomy/names.

how to do this for tens of thousands of existing records

For this issue, I'm pretty sure I could come up with a lossless migration path (and tying that into something like 7941 might make any affected records better blend in with the big picture).

In general, if you need some different UI, or think 7941 might get at what you'd like to have, or WHATEVER - let me know, we can probably figure something out.

camwebb commented 1 month ago

I agree with most of this but think there's an important difference in our underlying assumptions:

They are defined (by such type) as EXACTLY the same circumscription

No, AFAIK. The type must be within both circumscriptions, but those circumscriptions can be very different. Names:

can all have the same type but they are truly different names based on (specified or unspecified) different circumscriptions. And taxonomic concepts:

will aslo have the same type specimen. The variations in infraspecific rank and author string are meaningful and both the creator of an Arctos record and the searcher for a particular name variation need to be able to be specific.

Using taxonomic concepts as I demo'd would work, but it's a kludge, because all we know is the name and have no specification of the 'according to' (I'll delete that dummy pub when this thread closes - the dummy pub is required to create a taxon concept). The other options are verbatim identification and A {string} (as you state). I have demo'd both of these here. I had not thought to use A {string} (I don't use these often), but that's a great way to solve this. I searched for name '=Carex scita subsp. scabrinervia (Franch.) Koyama' and jumped to the specimen. However using name 'Carex scita scabrinervia' does not get you there - you have to use the slow taxonomy/classification search.

Converting all taxon names with infraspecific ranks into A {strings} would be quite easy I guess. And then converting/collapsing all names with ranks to the rankless trinomial would be easy too.

Please note that this would be a big change and all botanical users would need to be invited to discuss.

camwebb commented 1 month ago

@ArctosDB/taxonomy can we please discuss this at the next Zoom meeting?

dustymc commented 1 month ago

big change and all botanical users would need to be invited

Noted, no disagreement, I'd at least like to get a solid enough foundation to present something coherent (or kill this, whatever) from here, and thank you very much for helping. (I think this is sort of a meeting - maybe collision! - of specialities, and sometimes it's hard to find language which doesn't bear a bunch of unintended meaning there.)

have the same type specimen

Mind. Blown. So what the heck is the purpose of the type if it doesn't define the circumscription?! Anyone can ignore whatever's inconvenient while redescribing The Holy Artifact until it fits into their worldview, and good luck everyone else?! That can't possibly be right....

all we know is the name

How can you know it's a "valid" (I'm almost sure that's the wrong word, hopefully it carries the idea anyway) name if you don't have a publication?! Surely that's required by the ICBN?? (Or not, I'm sorta constantly impressed with what I don't seem to know about the Code I like to pretend I'm more familiar with, and it seems to be a lot less weird than the other thing...)

slow

https://github.com/ArctosDB/internal/issues/330 - I feel like that's the most trivial possible concern (certainly there are things much larger than Arctos which don't ever inspire feelings of 'slow'!), but also the least in my control (all it needs is $$, of which I control none), and perhaps the most unavoidable from a user's perspective (things work or don't).

Converting ... quite easy

I'm still not seeing any technical barriers, but there's clearly some usability (or something like it) expectation (or something, still) which would be addressed.

discuss this at the next Zoom meeting

Sounds good to me.

Jegelewicz commented 1 month ago

Added to the next meeting agenda

camwebb commented 1 month ago

all we know is the name

How can you know it's a "valid" (I'm almost sure that's the wrong word, hopefully it carries the idea anyway) name if you don't have a publication?

You are correct, and one could make a valid taxon concept of the original concept of a name by seeking the protolog (the original publication) and combining that with the name. However, in most name usage circumstances we don't know if the user (determiner) is applying the original taxon concept, or a concept (=~circumscription) by a subsequent reviser of the taxon (who correctly used the same name). So there is no valid way to convert a name into a concept without a specific publication. Hence the dummy publication (now deleted).

camwebb commented 1 month ago

For future readers. I deleted the dummy taxon concept from the demo because this is not the right approach. But here's a screenshot for posterity; see Identification 3:

idents