Closed dustymc closed 6 years ago
thx/done
broken
https://www.idigbio.org/portal/records/18805a8a-84d0-4ecd-8e56-013ce3a0b21f
"dwc:phylum": "Chordata",
"dwc:class": "Aves",
"dwc:scientificName": "Aves"
( "dwc:preparations": "long bone",
)
?
iDigBio receives many identifications with incomplete or non-existent classification data. We are as guilty as anyone of sending them such data (for an example, see http://arctos.database.museum/name/Childonias%20niger ). In an attempt to clarify and make these identifications searchable via means of higher taxa, iDigBio has created scripts to fill in or create the missing pieces. As with any script (we should know) there are things that will not turn out as expected. One of the scripts says that an identification that is monomial will be assumed to be a genus UNLESS Taxon_Rank says otherwise. In the case of the Aves, we are not giving iDigBio a Taxon_Rank, so the script first assumes we have left out part of the classification (order and family) and because there is no genus Aves, it then assumes we have misspelled Avus and assigns the classification you see as "broken" above.
While we can debate the merits of iDigBio logic all day, we could just fix this issue by providing Taxon_Rank.. It isn't really all that difficult for us to do and it will make our data look good.
Can we just do this please?
All of these records are being altered in some way when they reach iDigBio. This is only from colletions to which I have access....
I need details of WHAT and HOW, and I think we need input from at least @atrox10 if the "how" involves manually setting a value for each ID.
You are proposing only this:
Correct?
add identification.taxon_rank to the model and UI
Yes
push any values that appear there to DWC
Yes to dwc:taxonrank
details of WHAT and HOW
be a bulkloader field
That's possible, but slightly less that trivial.
term associated with taxon name
Here's where I get lost. If there's a taxon name (and classification, via collection preferences) used in a straightforward manner and that classification contains some ranks, I can get taxon rank from it. If there's a term used in a less-straightforward manner, I can get the rank of the term, but it's not necessarily strictly the same "level" as the ID (a "Sorex cinereus ?" ID would return "species" - the assertion is more like "probably species." Better than "IDK, maybe a bug?"!) I'm not sure what you're hoping to accomplish with this addition - it might be useful in fringe cases (I'm not likely to find a defensible rank for "Mus musculus domesticus and Siphonaptera"), but this seems like a lot of work for that, and I don't think you can explicitly assert a defensible taxon rank to that either.
required when creating a new name
That would be a major change to the model and definitely require discussion.
solve the homonym issue
I'm still not real clear on what this is either.
For everything already in Arctos, we could populate with "species" if a bi or trinomial and use the lowest classification possible for monomials with an attached classification.
I can do that on the fly with no data additions.
NULL should be a valid entry, at least for now since that is what we are sending for everything at this point.
My scripts would return NULL for "Mus musculus domesticus and Siphonaptera" - and I think that's the only defensible value for something like that.
The rest will need to be figured out by humans
See above - I doubt that's always going to be possible.
any of those we auto-populate may end up doing dumb stuff
Aside from the whole '"Sorex ?" is taxonRank genus' thing, the only way my scripts are going to find dumb stuff is if there's dumb stuff in the classification data. (In which case GIGO applies.)
Taxonomy Committee
Strictly speaking, this is an identification problem. (Or it's a bit of both, maybe.)
I'm still unclear regarding what an explicit assertion can do than we can't pull from the data.
I am fine with just pulling from the data. Anything else can be part of a larger taxonomy discussion.
A first-pass is completed in production. Please let me know if you see anything the scripts should have figured out and I can update them. I'll start pushing these data to FLAT, where they'll be available to DWC - hopefully that'll be done and maintaining itself by early next week.
Attached are IDs for which I could not determine taxon rank. Most are A {string} (and not having a taxon is the point of that formula) but a fair number are just bad data in Arctos.
http://arctos.database.museum/name/Trivia%20nivea has no classification data at all
http://arctos.database.museum/name/Urosaurus%20bicarinatus%20anonymorphus has minimal data
http://arctos.database.museum/name/Myriapoda has a term ranked two ways
http://arctos.database.museum/name/Aves#Arctos (what started this?!) claims to be a genus....
Ditto http://arctos.database.museum/name/Phocidae
And there's a lot of standardization to be had in A {string} names - looks like there are about a dozen ways of saying "flake." Time to figure out formal non-Linnean taxonomies? @sjshirar
http://arctos.database.museum/name/Aves#Arctos (what started this?!) claims to be a genus.... Ditto http://arctos.database.museum/name/Phocidae
This is what I complained about before. I am not able to delete that genus assertion no matter how hard I try. When I delete it it looks like it is gone, but when I leave and come back, it is there again.
http://arctos.database.museum/name/Trivia%20nivea has no classification data at all
Also has no specimens - delete?
http://arctos.database.museum/name/Urosaurus%20bicarinatus%20anonymorphus has minimal data
Filled in basic higher taxa.
http://arctos.database.museum/name/Myriapoda has a term ranked two ways
Removed the class term because Myriapoda is a subphylum. https://en.wikipedia.org/wiki/Myriapoda
But check it, because I get the genus flaky auto-suggest that I again, cannot seem to delete....
delete that genus assertion
It's a suggestion, not an assertion (until someone clicks save anyway!) - https://github.com/ArctosDB/arctos/issues/1188#issuecomment-314566131
no specimens - delete
http://arctos.database.museum/SpecimenResults.cfm?anyTaxId=10942232 is a specimen using it (if your login allows that). The ONLY reason to delete a taxon is you're sure it's not validly published (and maybe not then if it's useful - http://handbook.arctosdb.org/documentation/taxonomy.html)
I have been confused by that as well. Maybe we shouldn't make inaccurate suggestions that require an action to get rid of it on the part of the user , e.g. "delete this row?
If there is no genus, could we have a pop up suggestion box that says something like "You are choosing submitting a classification without a genus name. Click here to add a genus or click here to accept the current classification and taxon ranks."
On Mon, Sep 3, 2018 at 10:28 AM, dustymc notifications@github.com wrote:
delete that genus assertion
It's a suggestion, not an assertion (until someone clicks save anyway!) - #1188 (comment) https://github.com/ArctosDB/arctos/issues/1188#issuecomment-314566131
[image: screen shot 2018-09-03 at 9 23 09 am] https://user-images.githubusercontent.com/5720791/44996176-4dfb4e00-af5b-11e8-9417-70e5ad100ce6.png
[image: screen shot 2018-09-03 at 9 23 19 am] https://user-images.githubusercontent.com/5720791/44996185-52c00200-af5b-11e8-95cf-2aab1107c932.png
[image: screen shot 2018-09-03 at 9 23 30 am] https://user-images.githubusercontent.com/5720791/44996187-55225c00-af5b-11e8-9a52-beaf7f8e1aca.png
[image: screen shot 2018-09-03 at 9 23 50 am] https://user-images.githubusercontent.com/5720791/44996192-5a7fa680-af5b-11e8-911a-8d74f81f107e.png
no specimens - delete
http://arctos.database.museum/SpecimenResults.cfm?anyTaxId=10942232 is a specimen using it (if your login allows that). The ONLY reason to delete a taxon is you're sure it's not validly published (and maybe not then if it's useful - http://handbook.arctosdb.org/documentation/taxonomy.html)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-418157144, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hEBX2D6yTLaMosGwUba05opk278Cks5uXVjCgaJpZM4Qw4oN .
https://github.com/ArctosDB/arctos/issues/1188#issuecomment-314566131.
The current behavior seems generally less-evil than any alternatives I've found, and it arose from data problems. My increasingly-strong preference remains to deprecate the single-record edit option altogether and pipe everything through the bulkloader (which hopefully would mostly be an extension of the hierarchical editor, but that approach does not constrain us to local hierarchical data).
@ejbrock has been making lots of bird taxonomy edits via the hierarchical tool, and it's likely that this has been making things things harder to find than they need to be. It's absolutely not possible to be consistent when editing 2.5 million records one by one, and it's absolutely not possible to be inconsistent within a hierarchical structure. The single-record edits which introduce inconsistency fracture hierarchies, which both hides specimens from users and makes future large-scale edits difficult. (What should be a single hierarchy becomes one - often a very short one - for each inconsistency.)
It's a suggestion, not an assertion (until someone clicks save anyway!) - #1188 (comment)
So, when someone goes to add the author name, this "suggestion" will get saved if they don't really know to delete it every time. That seems a little evil to me.
I am still really unclear on how this bulkloader works, especially if I just need to add one taxon. This is something that anyone with taxonomy access needs to be trained with.
a little evil
Absolutely, but less evil than the alternative. (Most monomials are genera - I'm just playing the odds, and always up for better ideas.)
bulkloader ... add one taxon
For that use case, probably sort of a pain. I think we get to pick our poison - is requiring a lot of work for a "simple" task less-evil than providing a path which consistently introduces inconsistent data? I'm leaning that way; the tiny bits of garbage finds ways to propagate out into huge messes.
Would it be possible to add TAXON_RANK in the Non-Classification table that you could then use to modify your suggestions? If I had "Class" in the TAXON_RANK field, then you would know not to suggest anything below Class. Suggested controlled vocabulary for this field:
Kingdom Phylum Class Order Family Genus Species
Like this
I agree with Teresa. Most monomials in parasite collections include genera,phyla, classes, and families. There is no safe assumption there.
On Wed, Sep 5, 2018 at 3:15 PM, Teresa Mayfield notifications@github.com wrote:
Would it be possible to add TAXON_RANK in the Non-Classification table that you could then use to modify your suggestions? If I had "Class" in the TAXON_RANK field, then you would know not to suggest anything below Class. Suggested controlled vocabulary for this field:
Kingdom Phylum Class Order Family Genus Species
Like this [image: capture] https://user-images.githubusercontent.com/5725767/45121465-77180c00-b11e-11e8-9140-f1495b198f08.JPG
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-418884087, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hMaDS1KGKL31A780ZDNsNtRfEe1gks5uYD74gaJpZM4Qw4oN .
That's possible, but...
1) I'm not sure I see the point - would it ever be anything other than the lowest term in the hierarchy? 2) That isn't really useful for the complex/problem IDs, and IDs are what we share via DWC.
assumptions
There are 159,746 monomials in Arctos. 150,646 of them are ranked 'genus' in a local classification. ~94% of the time, the "monomials are genera" assumption is correct (I assume that's the reference??).
This is now implemented in production, and the data are available as taxonRank in the IPT view.
This also turned out to be the straw that broke the camel's DBMS_COMPARISON's back; we had to significantly change the way in which public data are processed, and that should now happen more or less in real-time.
Here are the data from FLAT:
UAM@ARCTOS> select taxon_rank || ' @ ' || count(*) from flat group by taxon_rank order by count(*);
TAXON_RANK||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
unranked clade @ 1
subdivision @ 1
infraorder @ 3
epifamily @ 4
infraclass @ 4
subpspecies @ 4
forma @ 13
subphylum @ 22
superorder @ 93
subclass @ 552
hyporder @ 574
tribe @ 633
superclass @ 1936
superfamily @ 2665
suborder @ 3301
variety @ 5729
subfamily @ 7321
kingdom @ 10831
phylum @ 16500
class @ 23381
order @ 94951
genus @ 128756
family @ 164410
@ 561031
subspecies @ 564074
species @ 1931137
26 rows selected.
And a bit of funky data that made it in.
select taxon_rank,scientific_name,guid from flat where taxon_rank not in (select TAXON_TERM from CTTAXON_TERM where IS_CLASSIFICATION=1) order by taxon_rank,scientific_name;
TAXON_RANK
------------------------------------------------------------------------------------------------------------------------
SCIENTIFIC_NAME
------------------------------------------------------------------------------------------------------------------------
GUID
------------------------------------------------------------------------------------------------------------------------
subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:17947
subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26718
subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26806
subpspecies
Ficus papyratia lindae
DMNS:Inv:16814
unranked clade
Merriamosauria
UAM:ES:2437
@sharpphyl @KatherineLAnderson
subpspecies Eclogavena quadrimaculata thielei DMNS:Inv:17947
subpspecies Eclogavena quadrimaculata thielei DMNS:Inv:26718
subpspecies Eclogavena quadrimaculata thielei DMNS:Inv:26806
subpspecies Ficus papyratia lindae DMNS:Inv:16814
It looks like switching from Arctos to WoRMS (via Arctos) has solved the above issue as both taxa have complete classifications now from WoRMS. Let me know if I'm missing anything.
unranked clade Merriamosauria UAM:ES:2437
@sharpphyl @KatherineLAnderson
Merriamosauria is a valid but unranked clade. Its classification in Arctos taxonomy is correct.
Without taxonRank, iDigBio "fixes" various taxonomy terms to random values which are sometimes completely unrelated to the original ID.
taxonRank is not a required field in DWC.
Arctos does not require taxa to be ranked (which is an accurate representation of taxonomy itself). Some identifications do not use taxa at all, others use multiple taxa, all of it may be ranked or not.
iDigBio's suggestion is to "add[] taxonRank as a required field in Arctos" which isn't possible or practical for many reasons.
When ranked taxonomy is available we fill out the appropriate "columns" in DWC - "Family," "Order" etc. From that we could find the most specific term which is ranked, but not "The taxonomic rank of the most specific name in the scientificName" (as specified in the Standard). The lowest ranked term also does not necessarily appear in the scientificName at all.
I can't quite see what we could do before exporting the DWC that wouldn't just be wrong in some instances. As always, I'm open to suggestions.