ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

taxonRank and aggregators #1338

Closed dustymc closed 6 years ago

dustymc commented 6 years ago

Without taxonRank, iDigBio "fixes" various taxonomy terms to random values which are sometimes completely unrelated to the original ID.

taxonRank is not a required field in DWC.

Arctos does not require taxa to be ranked (which is an accurate representation of taxonomy itself). Some identifications do not use taxa at all, others use multiple taxa, all of it may be ranked or not.

iDigBio's suggestion is to "add[] taxonRank as a required field in Arctos" which isn't possible or practical for many reasons.

When ranked taxonomy is available we fill out the appropriate "columns" in DWC - "Family," "Order" etc. From that we could find the most specific term which is ranked, but not "The taxonomic rank of the most specific name in the scientificName" (as specified in the Standard). The lowest ranked term also does not necessarily appear in the scientificName at all.

I can't quite see what we could do before exporting the DWC that wouldn't just be wrong in some instances. As always, I'm open to suggestions.

dustymc commented 6 years ago

thx/done

dustymc commented 6 years ago

broken

https://www.idigbio.org/portal/records/18805a8a-84d0-4ecd-8e56-013ce3a0b21f

"dwc:phylum": "Chordata",
"dwc:class": "Aves",
"dwc:scientificName": "Aves"

( "dwc:preparations": "long bone",)

screen shot 2018-08-15 at 5 21 45 pm

?

Jegelewicz commented 6 years ago

iDigBio receives many identifications with incomplete or non-existent classification data. We are as guilty as anyone of sending them such data (for an example, see http://arctos.database.museum/name/Childonias%20niger ). In an attempt to clarify and make these identifications searchable via means of higher taxa, iDigBio has created scripts to fill in or create the missing pieces. As with any script (we should know) there are things that will not turn out as expected. One of the scripts says that an identification that is monomial will be assumed to be a genus UNLESS Taxon_Rank says otherwise. In the case of the Aves, we are not giving iDigBio a Taxon_Rank, so the script first assumes we have left out part of the classification (order and family) and because there is no genus Aves, it then assumes we have misspelled Avus and assigns the classification you see as "broken" above.

While we can debate the merits of iDigBio logic all day, we could just fix this issue by providing Taxon_Rank.. It isn't really all that difficult for us to do and it will make our data look good.

Can we just do this please?

Jegelewicz commented 6 years ago

image

All of these records are being altered in some way when they reach iDigBio. This is only from colletions to which I have access....

dustymc commented 6 years ago

I need details of WHAT and HOW, and I think we need input from at least @atrox10 if the "how" involves manually setting a value for each ID.

You are proposing only this:

Correct?

Jegelewicz commented 6 years ago

add identification.taxon_rank to the model and UI

Yes

push any values that appear there to DWC

Yes to dwc:taxonrank

details of WHAT and HOW

dustymc commented 6 years ago

be a bulkloader field

That's possible, but slightly less that trivial.

term associated with taxon name

Here's where I get lost. If there's a taxon name (and classification, via collection preferences) used in a straightforward manner and that classification contains some ranks, I can get taxon rank from it. If there's a term used in a less-straightforward manner, I can get the rank of the term, but it's not necessarily strictly the same "level" as the ID (a "Sorex cinereus ?" ID would return "species" - the assertion is more like "probably species." Better than "IDK, maybe a bug?"!) I'm not sure what you're hoping to accomplish with this addition - it might be useful in fringe cases (I'm not likely to find a defensible rank for "Mus musculus domesticus and Siphonaptera"), but this seems like a lot of work for that, and I don't think you can explicitly assert a defensible taxon rank to that either.

required when creating a new name

That would be a major change to the model and definitely require discussion.

solve the homonym issue

I'm still not real clear on what this is either.

For everything already in Arctos, we could populate with "species" if a bi or trinomial and use the lowest classification possible for monomials with an attached classification.

I can do that on the fly with no data additions.

NULL should be a valid entry, at least for now since that is what we are sending for everything at this point.

My scripts would return NULL for "Mus musculus domesticus and Siphonaptera" - and I think that's the only defensible value for something like that.

The rest will need to be figured out by humans

See above - I doubt that's always going to be possible.

any of those we auto-populate may end up doing dumb stuff

Aside from the whole '"Sorex ?" is taxonRank genus' thing, the only way my scripts are going to find dumb stuff is if there's dumb stuff in the classification data. (In which case GIGO applies.)

Taxonomy Committee

Strictly speaking, this is an identification problem. (Or it's a bit of both, maybe.)

I'm still unclear regarding what an explicit assertion can do than we can't pull from the data.

Jegelewicz commented 6 years ago

I am fine with just pulling from the data. Anything else can be part of a larger taxonomy discussion.

dustymc commented 6 years ago

A first-pass is completed in production. Please let me know if you see anything the scripts should have figured out and I can update them. I'll start pushing these data to FLAT, where they'll be available to DWC - hopefully that'll be done and maintaining itself by early next week.

Attached are IDs for which I could not determine taxon rank. Most are A {string} (and not having a taxon is the point of that formula) but a fair number are just bad data in Arctos.

http://arctos.database.museum/name/Trivia%20nivea has no classification data at all

http://arctos.database.museum/name/Urosaurus%20bicarinatus%20anonymorphus has minimal data

http://arctos.database.museum/name/Myriapoda has a term ranked two ways

http://arctos.database.museum/name/Aves#Arctos (what started this?!) claims to be a genus....

Ditto http://arctos.database.museum/name/Phocidae

And there's a lot of standardization to be had in A {string} names - looks like there are about a dozen ways of saying "flake." Time to figure out formal non-Linnean taxonomies? @sjshirar

temp_notaxrank.csv.zip

Jegelewicz commented 6 years ago

http://arctos.database.museum/name/Aves#Arctos (what started this?!) claims to be a genus.... Ditto http://arctos.database.museum/name/Phocidae

This is what I complained about before. I am not able to delete that genus assertion no matter how hard I try. When I delete it it looks like it is gone, but when I leave and come back, it is there again.

Jegelewicz commented 6 years ago

http://arctos.database.museum/name/Trivia%20nivea has no classification data at all

Also has no specimens - delete?

Jegelewicz commented 6 years ago

http://arctos.database.museum/name/Urosaurus%20bicarinatus%20anonymorphus has minimal data

Filled in basic higher taxa.

Jegelewicz commented 6 years ago

http://arctos.database.museum/name/Myriapoda has a term ranked two ways

Removed the class term because Myriapoda is a subphylum. https://en.wikipedia.org/wiki/Myriapoda

But check it, because I get the genus flaky auto-suggest that I again, cannot seem to delete....

dustymc commented 6 years ago

delete that genus assertion

It's a suggestion, not an assertion (until someone clicks save anyway!) - https://github.com/ArctosDB/arctos/issues/1188#issuecomment-314566131

screen shot 2018-09-03 at 9 23 09 am screen shot 2018-09-03 at 9 23 19 am screen shot 2018-09-03 at 9 23 30 am screen shot 2018-09-03 at 9 23 50 am

no specimens - delete

http://arctos.database.museum/SpecimenResults.cfm?anyTaxId=10942232 is a specimen using it (if your login allows that). The ONLY reason to delete a taxon is you're sure it's not validly published (and maybe not then if it's useful - http://handbook.arctosdb.org/documentation/taxonomy.html)

campmlc commented 6 years ago

I have been confused by that as well. Maybe we shouldn't make inaccurate suggestions that require an action to get rid of it on the part of the user , e.g. "delete this row?

If there is no genus, could we have a pop up suggestion box that says something like "You are choosing submitting a classification without a genus name. Click here to add a genus or click here to accept the current classification and taxon ranks."

On Mon, Sep 3, 2018 at 10:28 AM, dustymc notifications@github.com wrote:

delete that genus assertion

It's a suggestion, not an assertion (until someone clicks save anyway!) - #1188 (comment) https://github.com/ArctosDB/arctos/issues/1188#issuecomment-314566131

[image: screen shot 2018-09-03 at 9 23 09 am] https://user-images.githubusercontent.com/5720791/44996176-4dfb4e00-af5b-11e8-9417-70e5ad100ce6.png

[image: screen shot 2018-09-03 at 9 23 19 am] https://user-images.githubusercontent.com/5720791/44996185-52c00200-af5b-11e8-95cf-2aab1107c932.png

[image: screen shot 2018-09-03 at 9 23 30 am] https://user-images.githubusercontent.com/5720791/44996187-55225c00-af5b-11e8-9a52-beaf7f8e1aca.png

[image: screen shot 2018-09-03 at 9 23 50 am] https://user-images.githubusercontent.com/5720791/44996192-5a7fa680-af5b-11e8-911a-8d74f81f107e.png

no specimens - delete

http://arctos.database.museum/SpecimenResults.cfm?anyTaxId=10942232 is a specimen using it (if your login allows that). The ONLY reason to delete a taxon is you're sure it's not validly published (and maybe not then if it's useful - http://handbook.arctosdb.org/documentation/taxonomy.html)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-418157144, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hEBX2D6yTLaMosGwUba05opk278Cks5uXVjCgaJpZM4Qw4oN .

dustymc commented 6 years ago

https://github.com/ArctosDB/arctos/issues/1188#issuecomment-314566131.

The current behavior seems generally less-evil than any alternatives I've found, and it arose from data problems. My increasingly-strong preference remains to deprecate the single-record edit option altogether and pipe everything through the bulkloader (which hopefully would mostly be an extension of the hierarchical editor, but that approach does not constrain us to local hierarchical data).

@ejbrock has been making lots of bird taxonomy edits via the hierarchical tool, and it's likely that this has been making things things harder to find than they need to be. It's absolutely not possible to be consistent when editing 2.5 million records one by one, and it's absolutely not possible to be inconsistent within a hierarchical structure. The single-record edits which introduce inconsistency fracture hierarchies, which both hides specimens from users and makes future large-scale edits difficult. (What should be a single hierarchy becomes one - often a very short one - for each inconsistency.)

Jegelewicz commented 6 years ago

It's a suggestion, not an assertion (until someone clicks save anyway!) - #1188 (comment)

So, when someone goes to add the author name, this "suggestion" will get saved if they don't really know to delete it every time. That seems a little evil to me.

Jegelewicz commented 6 years ago

I am still really unclear on how this bulkloader works, especially if I just need to add one taxon. This is something that anyone with taxonomy access needs to be trained with.

dustymc commented 6 years ago

a little evil

Absolutely, but less evil than the alternative. (Most monomials are genera - I'm just playing the odds, and always up for better ideas.)

bulkloader ... add one taxon

For that use case, probably sort of a pain. I think we get to pick our poison - is requiring a lot of work for a "simple" task less-evil than providing a path which consistently introduces inconsistent data? I'm leaning that way; the tiny bits of garbage finds ways to propagate out into huge messes.

Jegelewicz commented 6 years ago

Would it be possible to add TAXON_RANK in the Non-Classification table that you could then use to modify your suggestions? If I had "Class" in the TAXON_RANK field, then you would know not to suggest anything below Class. Suggested controlled vocabulary for this field:

Kingdom Phylum Class Order Family Genus Species

Like this capture

campmlc commented 6 years ago

I agree with Teresa. Most monomials in parasite collections include genera,phyla, classes, and families. There is no safe assumption there.

On Wed, Sep 5, 2018 at 3:15 PM, Teresa Mayfield notifications@github.com wrote:

Would it be possible to add TAXON_RANK in the Non-Classification table that you could then use to modify your suggestions? If I had "Class" in the TAXON_RANK field, then you would know not to suggest anything below Class. Suggested controlled vocabulary for this field:

Kingdom Phylum Class Order Family Genus Species

Like this [image: capture] https://user-images.githubusercontent.com/5725767/45121465-77180c00-b11e-11e8-9140-f1495b198f08.JPG

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-418884087, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hMaDS1KGKL31A780ZDNsNtRfEe1gks5uYD74gaJpZM4Qw4oN .

dustymc commented 6 years ago

That's possible, but...

1) I'm not sure I see the point - would it ever be anything other than the lowest term in the hierarchy? 2) That isn't really useful for the complex/problem IDs, and IDs are what we share via DWC.

assumptions

There are 159,746 monomials in Arctos. 150,646 of them are ranked 'genus' in a local classification. ~94% of the time, the "monomials are genera" assumption is correct (I assume that's the reference??).

dustymc commented 6 years ago

This is now implemented in production, and the data are available as taxonRank in the IPT view.

This also turned out to be the straw that broke the camel's DBMS_COMPARISON's back; we had to significantly change the way in which public data are processed, and that should now happen more or less in real-time.

Here are the data from FLAT:

UAM@ARCTOS> select taxon_rank || ' @ ' || count(*) from flat group by taxon_rank order by count(*);

TAXON_RANK||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
unranked clade @ 1
subdivision @ 1
infraorder @ 3
epifamily @ 4
infraclass @ 4
subpspecies @ 4
forma @ 13
subphylum @ 22
superorder @ 93
subclass @ 552
hyporder @ 574
tribe @ 633
superclass @ 1936
superfamily @ 2665
suborder @ 3301
variety @ 5729
subfamily @ 7321
kingdom @ 10831
phylum @ 16500
class @ 23381
order @ 94951
genus @ 128756
family @ 164410
 @ 561031
subspecies @ 564074
species @ 1931137

26 rows selected.

And a bit of funky data that made it in.

select taxon_rank,scientific_name,guid from flat where taxon_rank not in (select TAXON_TERM from CTTAXON_TERM where IS_CLASSIFICATION=1) order by taxon_rank,scientific_name;

TAXON_RANK
------------------------------------------------------------------------------------------------------------------------
SCIENTIFIC_NAME
------------------------------------------------------------------------------------------------------------------------
GUID
------------------------------------------------------------------------------------------------------------------------
subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:17947

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26718

subpspecies
Eclogavena quadrimaculata thielei
DMNS:Inv:26806

subpspecies
Ficus papyratia lindae
DMNS:Inv:16814

unranked clade
Merriamosauria
UAM:ES:2437

@sharpphyl @KatherineLAnderson

sharpphyl commented 5 years ago

subpspecies Eclogavena quadrimaculata thielei DMNS:Inv:17947

subpspecies Eclogavena quadrimaculata thielei DMNS:Inv:26718

subpspecies Eclogavena quadrimaculata thielei DMNS:Inv:26806

subpspecies Ficus papyratia lindae DMNS:Inv:16814

It looks like switching from Arctos to WoRMS (via Arctos) has solved the above issue as both taxa have complete classifications now from WoRMS. Let me know if I'm missing anything.

KatherineLAnderson commented 5 years ago

unranked clade Merriamosauria UAM:ES:2437



@sharpphyl @KatherineLAnderson

Merriamosauria is a valid but unranked clade. Its classification in Arctos taxonomy is correct.