ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

taxon validator #1529

Closed dustymc closed 6 years ago

dustymc commented 6 years ago

There's a new tool in data services which checks names against various taxonomy "authorities."

It needs plugged in to all taxon name creation.

Please comment if you know of any useful taxonomy validation services which could be added. Current using

  1. wikidata
  2. GNI
  3. WORMs
dustymc commented 6 years ago

Added GBIF and EOL services

I pulled the ~5K newest names out of Arctos and ran them through the validator. 277 of them weren't in any service. I quickly scanned through Google (there's a link on the validator, but Google won't let me use their API) for those, and found 63 weird enough to file an Annotation (https://arctos.database.museum/info/reviewAnnotation.cfm?action=show&atype=taxon&guid_prefix=&reviewer_comment=&submitter=DLM&reviewer=). Many of them are data we knew would be potentially problematic when we created them, and I'm sure more are "real" but just obscure. Yay us! This should help make those data even better, and GREATLY simplify cleanup for new collections (where this all started).

Jegelewicz commented 6 years ago

Just for grins, I followed the link and picked this: Amphidonte texana

It doesn't appear that there are any records in Arctos using this ID and perhaps that is true for other stuff on the list. Could we just do away with those things that don't appear to be valid and aren't in use?

dustymc commented 6 years ago

Welcome to the eternal debate....

These are the latest entered, so they're probably there because someone wants to use them (or was going to use them and then discovered they're garbage...).

That one's used for http://arctos.database.museum/guid/UNM:ES:15147 - you can probably see it logged out.

Entering taxa is sort of a pain, so I've made a habit of snarfing up anything that looks like it might become useful. That makes bringing in new things easy(ier), and it also inevitably introduces garbage. EVERY source of taxonomy (except maybe viruses and bacteria) of which I'm aware contains garbage, either because nobody actually knows how to spell something, or they're trying to capture things that are flat wrong but still in popular usage (like Arctos), or because they have no idea where taxonomy ends and identification begins, or because they have cruddy data controls, or SOMETHING.

I'm completely open to discussion, but most days I tend to think the "grab everything" approach saves more work than it creates. (And for better or worse, it drive a ton of traffic to Arctos.)

And of course this service talks to some of those sources of junk data, some of which draw from Arctos. "Validator" is a really strong word to describe the situation. "Occasionally-successful junk-finder"?

Jegelewicz commented 6 years ago

Doh! Forgot about the being logged in part....

Occasionally-successful junk-finder

Perfect. I know for certain there are plant names that are referring to the same thing but are spelled two different ways (ae vs. i somewhere in the name). When we are at the point of having a taxonomy clean-up subgroup, they can help us make those decisions I guess...or not. In the mean time, those annotations might help, or they might just be at the very bottom of someone's very long to-do list. Either way, it's a start!

dustymc commented 6 years ago

https://github.com/ArctosDB/arctos/issues/757

select a.scientific_name, b.scientific_name from taxon_name a, taxon_name b where regexp_count(a.scientific_name,' ')=1 and regexp_count(b.scientific_name,' ')=1 and a.scientific_name!=b.scientific_name and a.scientific_name like '%us' and b.scientific_name like '%a' and substr(a.scientific_name,0,length(a.scientific_name)-2)=substr(b.scientific_name,0,length(b.scientific_name)-1) ; ...

Hoplistopsis geminatus Hoplistopsis geminata

4913 rows selected.

That's certainly not the only spelling variation.

And for that one globalnames doesn't know they're the same (and they may not be - maybe one's only been used for flies, the other is a fossil tree or something for all I know - there are at least tens of thousands of hemihomonyms in Arctos, there's probably something double-weird about a few of them) - we may use both of them in IDs, users probably look for one or the other, NOBODY ends up where they want to be.

Yay taxonomy....

dustymc commented 6 years ago

This is now as integrated as I know how to make it.

screen shot 2018-05-22 at 3 26 14 pm screen shot 2018-05-22 at 3 26 42 pm

Closing.