taxonomy: dynamic classification sources

dustymc commented 5 years ago

Managing all animals in the "Arctos" classification is often problematic (Ex: http://arctos.database.museum/name/Cepolidae), and a bunch of plants-and-stuff that will surely find a way to clash sooner or later have been reintroduced by paleo collections.

Managing classifications in much smaller chunks avoids taxonomy-at-scale weirdness, but

1) Most collections have cataloged a few outliers and need taxonomy for them 2) I think everyone wants to pull expertise, which involves sharing a Source with the experts, which involves huge cumbersome groups of classifications

https://github.com/ArctosDB/arctos/issues/1852 would fix this: it doesn't matter which source a classification is in if you can select it individually, but I don't think we're realistically going to compile the data nor use a taxon concepts system.

From https://github.com/ArctosDB/arctos/issues/1852#issuecomment-484545346

Potential not-concepts solution [to the perceived homonym problem]: create "dynamic" sources which are based on collection-defined criteria and auto-refresh themselves periodically. Selection could cross sources, include things like taxon_status or various ranks, etc. Data would be managed in the shared (eg, "Arctos") Source(s) and the dynamic source would be refreshed from updates.

Dynamic sources would address the idea that the scale at which taxonomy is best managed and the scale at which taxonomy is used are not necessarily the same.

Simplest case, a teaching collection might pull relevant animals from "Arctos" and relevant plants from "Arctos Plants."

DMNS:Inv could

pull everything, or perhaps only everything they need, from "WoRMS (via Arctos)"
pull from "Arctos" for any IDs that still don't have a classification
pull from NCBI for any IDs that still don't have a classification
pull from ... for any IDs that still don't have a classification

Someone or some coalition could manage any group (species+subspecies, family, phylum, 'stuff we need that isn't in some other source' eg land snails, etc.) in the system of their choosing (including Arctos, the Arctos Hierarchical Editor, a desktop app, a remote system like WoRMS, etc.), then anyone else could pull those data or parts of them into their "preferred" classification.

There are no real barriers to this; it will fit in the current structure, we just need some (complicated and expensive, probably) SQL-or-something to build and maintain the merged classifications.

Nobody would be forced into this; the capability would not necessarily change anything for any existing collection, it would just add the possibility of combining existing data.

There are potentially consistency problems - maybe the Murids source classification will include superfamily and the Cricetids source classification will not, resulting in inconsistent rodents - but I suspect that would still be more overall consistent than the current data (in which individual names are often outliers).

It is worth comparing the scale of taxonomy in Arctos with the scale of taxonomy used by collections here; dynamic classifications could result in much smaller datasets, which might support more discovery methods.

UAM@ARCTOS> select count(*) numberOfNamesInArctos from taxon_name;

NUMBEROFNAMESINARCTOS
---------------------
          3,408,094

UAM@ARCTOS> select count(distinct(classification_id)) numberOfClassifnsInArctos from taxon_term;

NUMBEROFCLASSIFNSINARCTOS
-------------------------
         15,842,709

UAM@ARCTOS>  select count(distinct(taxon_name_id)) numberManagedTaxa from taxon_term where source='Arctos';

NUMBERMANAGEDTAXA
-----------------
      1,707,718

UAM@ARCTOS> select count(distinct(taxon_name_id)) numberUsedTaxa from identification_taxonomy;

NUMBERUSEDTAXA
--------------
    109,575

select
  guid_prefix,
  to_char(count(distinct(taxon_name_id)),'999,999,999,999')  numberUsedNames
from
  collection,
  cataloged_item,
  identification,
  identification_taxonomy
where
  collection.collection_id=cataloged_item.collection_id and
  cataloged_item.collection_object_id=identification.collection_object_id and
  identification.identification_id=identification_taxonomy.identification_id
group by
  guid_prefix
order by
  count(distinct(taxon_name_id))
 17  ;

GUID_PREFIX      NUMBERUSEDNAMES
-------------------- ------------------------------------------------
BYU:Herp                1
NMU:Para                1
MSBObs:Mamm             1
UAM:Art                 1
UAM:Env                 1
UCSC:Herp               1
UTEPObs:Ento                1
CHAS:Herp               2
UWYMV:Egg               2
DGR:Ento                3
NBSB:Bird               4
OWU:Para                5
KNWRObs:Fish                9
UAMObs:Fish            10
COA:Ento               13
MLZ:Herb               14
MVZObs:Mamm            16
MVZObs:Herp            19
COA:Herp               20
OWU:Fish               20
KNWR:Inv               22
MLZ:Egg                23
OWU:Bird               25
UTEP:Zoo               25
CHAS:Fish              26
UAMObs:Mamm            27
UAM:EH                 28
DMNS:Herp              36
ASNHC:Mamm             39
UTEP:Arc               40
UWYMV:Herp             45
UAM:Herp               49
UTEP:Fish              51
WNMU:Fish              60
UCSC:Mamm              61
DMNS:Para              61
COA:Mamm               61
UWYMV:Fish             62
COA:Egg                79
OWU:Mamm               79
OWU:Rept               89
DGR:Mamm               92
UNR:Bird               99
UCM:Obs               115
UTEPObs:Herp              122
CHAS:EH               125
UCSC:Bird             128
UAM:Arc               141
UTEP:Teach            145
NMU:Bird              147
CHAS:Herb             148
NMU:Mamm              149
UNR:Herp              151
UAMObs:Bird           163
MLZ:Mamm              194
UNR:Mamm              211
UNR:Fish              232
COA:Bird              281
UWYMV:Mamm            287
MVZ:Hild              310
UTEP:Mamm             318
APSU:Herp             319
UTEP:HerpOS           330
USNPC:Para            343
OWU:ES                353
WNMU:Bird             422
MSB:Herp              453
UNM:ES                454
DGR:Bird              491
MSB:Fish              506
WNMU:Mamm             518
MVZObs:Bird           560
UWBM:Herp             574
ALMNH:ES              589
UCM:Egg               591
UTEP:Bird             595
UWYMV:Bird            599
UAM:Fish              627
UMZM:Mamm             639
UTEP:ES               665
CHAS:Mamm             675
UAM:Mamm              698
UMNH:Mamm             720
UAM:Alg               723
MSB:Para              730
DMNS:Egg              746
MSB:Host              802
UCM:Fish              804
UMNH:Herp             819
KWP:Ento              880
KNWR:Herb             895
CHAS:Teach            957
UTEP:Ento             963
UMZM:Bird             967
UWBM:Mamm             973
UAM:ES                986
DMNS:Mamm             991
CHAS:Egg              991
UMNH:Bird           1,074
UTEP:Herp           1,077
UTEP:Inv            1,191
KNWR:Ento           1,409
UCM:Bird            1,437
UCM:Mamm            1,448
CHAS:Bird           2,335
MLZ:Bird            2,411
DMNS:Bird           2,473
UAM:Bird            2,774
MVZ:Egg             2,965
UCM:Herp            3,160
MSB:Bird            3,183
UAM:Inv             3,199
MSB:Mamm            3,513
UAMb:Herb           4,064
HWML:Para           4,766
MVZ:Herp            5,284
CHAS:Inv            5,472
MVZ:Mamm            5,547
UAM:Ento            5,989
CHAS:Ento           6,774
UAM:Herb            8,240
UAMObs:Ento         9,739
DMNS:Inv               10,171
MVZ:Bird               11,137
UTEP:Herb              21,842

--- same data, different sort

select
  guid_prefix,
  to_char(count(distinct(taxon_name_id)),'999,999,999,999')  numberUsedNames
from
  collection,
  cataloged_item,
  identification,
  identification_taxonomy
where
  collection.collection_id=cataloged_item.collection_id and
  cataloged_item.collection_object_id=identification.collection_object_id and
  identification.identification_id=identification_taxonomy.identification_id
group by
  guid_prefix
order by
  guid_prefix
 17  ;

GUID_PREFIX      NUMBERUSEDNAMES
-------------------- ------------------------------------------------
ALMNH:ES              589
APSU:Herp             319
ASNHC:Mamm             39
BYU:Herp                1
CHAS:Bird           2,335
CHAS:EH               125
CHAS:Egg              991
CHAS:Ento           6,774
CHAS:Fish              26
CHAS:Herb             148
CHAS:Herp               2
CHAS:Inv            5,472
CHAS:Mamm             675
CHAS:Teach            957
COA:Bird              281
COA:Egg                79
COA:Ento               13
COA:Herp               20
COA:Mamm               61
DGR:Bird              491
DGR:Ento                3
DGR:Mamm               92
DMNS:Bird           2,473
DMNS:Egg              746
DMNS:Herp              36
DMNS:Inv               10,171
DMNS:Mamm             991
DMNS:Para              61
HWML:Para           4,766
KNWR:Ento           1,409
KNWR:Herb             895
KNWR:Inv               22
KNWRObs:Fish                9
KWP:Ento              880
MLZ:Bird            2,411
MLZ:Egg                23
MLZ:Herb               14
MLZ:Mamm              194
MSB:Bird            3,183
MSB:Fish              506
MSB:Herp              453
MSB:Host              802
MSB:Mamm            3,513
MSB:Para              730
MSBObs:Mamm             1
MVZ:Bird               11,137
MVZ:Egg             2,965
MVZ:Herp            5,284
MVZ:Hild              310
MVZ:Mamm            5,547
MVZObs:Bird           560
MVZObs:Herp            19
MVZObs:Mamm            16
NBSB:Bird               4
NMU:Bird              147
NMU:Mamm              149
NMU:Para                1
OWU:Bird               25
OWU:ES                353
OWU:Fish               20
OWU:Mamm               79
OWU:Para                5
OWU:Rept               89
UAM:Alg               723
UAM:Arc               141
UAM:Art                 1
UAM:Bird            2,774
UAM:EH                 28
UAM:ES                986
UAM:Ento            5,989
UAM:Env                 1
UAM:Fish              627
UAM:Herb            8,240
UAM:Herp               49
UAM:Inv             3,199
UAM:Mamm              698
UAMObs:Bird           163
UAMObs:Ento         9,739
UAMObs:Fish            10
UAMObs:Mamm            27
UAMb:Herb           4,064
UCM:Bird            1,437
UCM:Egg               591
UCM:Fish              804
UCM:Herp            3,160
UCM:Mamm            1,448
UCM:Obs               115
UCSC:Bird             128
UCSC:Herp               1
UCSC:Mamm              61
UMNH:Bird           1,074
UMNH:Herp             819
UMNH:Mamm             720
UMZM:Bird             967
UMZM:Mamm             639
UNM:ES                454
UNR:Bird               99
UNR:Fish              232
UNR:Herp              151
UNR:Mamm              211
USNPC:Para            343
UTEP:Arc               40
UTEP:Bird             595
UTEP:ES               665
UTEP:Ento             963
UTEP:Fish              51
UTEP:Herb              21,842
UTEP:Herp           1,077
UTEP:HerpOS           330
UTEP:Inv            1,191
UTEP:Mamm             318
UTEP:Teach            145
UTEP:Zoo               25
UTEPObs:Ento                1
UTEPObs:Herp              122
UWBM:Herp             574
UWBM:Mamm             973
UWYMV:Bird            599
UWYMV:Egg               2
UWYMV:Fish             62
UWYMV:Herp             45
UWYMV:Mamm            287
WNMU:Bird             422
WNMU:Fish              60
WNMU:Mamm             518

Jegelewicz commented 4 years ago

Let's talk about how GloBi does taxonomy. See Enhydra lutris

which links to all of the various taxonomic sources. This is done through a resolver. Zenodo

Could we free ourselves from managing taxonomy in Arctos by using a tool like this?

dustymc commented 4 years ago

Alternate approach which might be mostly functionally identical but require less development, processors, and sorta everything else:

collection.preferred_taxonomy_source's datatype is currently FKEY-->classification_source, which is interpreted as "use classification data from SOURCE, else fail with no cached classification data."

Converting to ordered array (supported by PG) would be interpreted as "use SourceA if exists, else use SourceB if exists, else use SourceC if exists, else fail with no cached classification data."

So for example a collection could...

manage their own shrew data in their own classification, and prefer that for shrew IDs
manage their own North American bat data in their own classification, and prefer that for NA bats
use some shared classification for everything else

dustymc commented 4 years ago

Un-wishlisting this; this approach should be comparatively trivial to implement and would have significant impacts.

DMNS:Inv could just use (and perhaps help improve) the Arctos classification for things not in WoRMS.

Animal-centric paleo collections could fall back to Arctos Plants for plant material, which would stop the continual reintroduction of plants to the "Arctos" classification. Problems caused by homonyms in the same classification - and there are many thousands of them - are what caused us to split classifications in the first place.

Suggest prioritization; the single classification per collection is actively introducing potentially-problematic data.

campmlc commented 4 years ago

I support this, with high priority.

On Thu, Aug 13, 2020 at 11:06 AM dustymc notifications@github.com wrote:

[EXTERNAL]*

Un-wishlisting this; this approach should be comparatively trivial to implement and would have significant impacts.

DMNS:Inv could just use (and perhaps help improve) the Arctos classification for things not in WoRMS.

Animal-centric paleo collections could fall back to Arctos Plants for plant material, which would stop the continual reintroduction of plants to the "Arctos" classification. Problems caused by homonyms in the same classification - and there are many thousands of them - are what caused us to split classifications in the first place.

Suggest prioritization; the single classification per collection is actively introducing potentially-problematic data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2231#issuecomment-673596882, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBGOKDJOTWMCW5UEUNLSAQMRXANCNFSM4IOXE42Q .

sharpphyl commented 4 years ago

DMNS:Inv could just use (and perhaps help improve) the Arctos classification for things not in WoRMS.

Totally agree this would be better than mucking up WoRMS (via Arctos) with names they don't have.

dustymc commented 4 years ago

This is mostly functional and should be out tonight or possibly tomorrow. It will need documentation. I can demonstrate whatever you'd like to see in test, but https://github.com/ArctosDB/internal/issues/65 makes it difficult to see for yourself. There are two changes:

Collections can now specify any number of Sources, in order, rather than one, and
The tool that builds the cache (FLAT) uses that information to locate classifications

Manage collection looks like this:

which is interpreted as "if all taxa used in an identification have at least one Arctos classification then use that, if not check Arctos Plants, if not there then check Worms, if not there then we're at the end of the list so do nothing."

I hope this will lead to more smaller and cleaner classifications. CollectionA has a shrew taxonomist, so they start a "Soricidae according to us" classification and do cool things with a manageable number of taxa, CollectionB has a bat taxonomist so they do the same, all mammal-having collections then use

Soricidae according to us
Bats according to other-us
Some giant klunky thing that's difficult to manage, but at least it says SOMETHING about wildebeest

If CollectionC doesn't like what CollectionB has done with some bats, they can just create a "Phyllostomidae" classification and use...

Phyllostomidae
Soricidae according to us
Bats according to other-us (which will control all non-Phyllostomid bat taxa for this collection)
Some giant klunky thing that's difficult to manage, but at least it says SOMETHING about wildebeest

This is a different viewpoint than originally laid out, but I believe it leads to about the same place - collections can "prefer" bits and pieces of multiple classifications, managers can deal with the 50 rabbits they really care about without being force-fed a million insects which are in the same classification for some reason, and then collections can use those well-curated rabbit data without also needing to somehow munge aardvarks in with it.

That also means the rabbit-manager cannot possibly "oops" those million insects, which are in a different compartment, so this could open up the possibility of a hierarchical (or otherwise simplified) editor which writes directly to the core tables.

This makes documenting sources - https://github.com/ArctosDB/arctos/issues/3019 - even more important.

Yay everybody?

sharpphyl commented 4 years ago

If I understand correctly, for DMNS:Inv, we would first choose Source WoRMS (via Arctos) then Source Arctos.

Next, I would copy any classifications that I've created without an aphiaid in WoRMs (via Arctos) to Arctos and delete them from WoRMS (via Arctos). I would still be the person listed as "managed by." There would be no classifications in WoRMS (via Arctos) without an aphiaid.

If - and it has happened to about 500 names since we started using WoRMs (via Arctos) - WoRMS adds a new name, my identification would automatically switch to WoRMS (via Arctos) and show the new classification. Once a year or so, you could probably give me a list of names that list me as the "managed by" that now have a WoRMs aphiaid, so I could remove my name.

Sounds perfectly awesome and I'm on board. Will need to do a lot of documentation updating, probably at the same time as all the changes we're consolidating per your request #2695.

Yes, YAY!

dustymc commented 4 years ago

Yes, essentially.

Falling back to "Arctos" isn't necessary - you can do that, or create something new, or whatever, but not being limited to one classification source is the big picture.

I'm advocating getting rid of "managed by" as a term altogether now that there's less reason to have giant all-encompassing classifications but whatever, it's not hurting anything, if it makes you happy then rock on!

I might eventually get around to advocating for the WoRMS classification to be purely service-managed, but we can talk about that when/if we get there.

identification would automatically switch

Yup.

sharpphyl commented 4 years ago

Help. I tried to change our source selection by making WoRMS (via Arctos) 1 and adding Arctos as 2.

When I save it, it reverses the order

dustymc commented 4 years ago

Neato, thanks!

I applied duct tape, should be doing what you want but I'll think about that form some more.

sharpphyl commented 4 years ago

Thanks. I'll test out a few records and see if anything else needs taping.

Jegelewicz commented 4 years ago

I might eventually get around to advocating for the WoRMS classification to be purely service-managed, but we can talk about that when/if we get there.

Once this is working - I vote we do as Dusty suggests

sharpphyl commented 4 years ago

As a test, this morning I took Achatinella bryonii which isn't in WoRMS so there isn't been an aphiaid for it. I copied the entire classification that I had created in WoRMS (via Arctos) into Arctos and deleted the WoRMS (via Arctos) classification. It appears that the catalog record is able to find the correct classification but it doesn't show yet in the taxonomy page that the Source for DMNS:Inv for this particular name has changed to Arctos. Should that happen or will it take a while for it to change?

dustymc commented 4 years ago

I'll update that. It's just a view of collection settings, nothing's broken....

Jegelewicz commented 4 years ago

@Nicole-Ridgwell-NMMNHS with this in place - I think we should set up a separate taxonomy source for geology stuff - I'll propose in a new issue once we have our data ready.

dustymc commented 4 years ago

FWIW I sort of expect any diverse+active paleo collection is going to end up with about 20 taxonomy sources, assuming this is FINALLY the thing that gets people to managing taxonomy in Arctos.....

I don't see any problems with geology collections or mineral taxonomy or etc., but I suspect we're missing some tools - would be good to get that fleshed out ASAP, and of course real data always forges better tools.

Jegelewicz commented 4 years ago

I don't see any problems with geology collections or mineral taxonomy or etc., but I suspect we're missing some tools - would be good to get that fleshed out ASAP, and of course real data always forges better tools.

We have a working set of data and a plan that we are putting before a few geologists before we put it up for more community discussion. Should be a new issue soon...

Nicole-Ridgwell-NMMNHS commented 4 years ago

Yay! I am excited about this. This will be great for our minerals and I'm looking forward to eventually building up a phylocode classification!

dustymc commented 4 years ago

building up a phylocode classification!

If that means what I think it does, it's going to make us think about tools. A few examples of all the complexity that might be needed by any record would give me something to think about, should you happen to have some data hanging around....

Jegelewicz commented 4 years ago

@dustymc this is what we tentatively have for minerals, rocks and chemical elements. Have a blast. Geology Taxonomy.zip ...

dustymc commented 4 years ago

Excellent, thanks!

Nicole-Ridgwell-NMMNHS commented 4 years ago

Here is a download of data for Ornithischia, excluding genus/species from the Paleobiology Database, it is a mix of ranked and unranked terms: PBDB Ornithischia.zip I think having something like the hierarchical taxonomy editor that would work for unranked terms would be essential.

campmlc commented 4 years ago

Well, this is well timed! I'll let you all mention as needed in today's discussion.

On Wed, Aug 19, 2020 at 9:33 AM Nicole-Ridgwell-NMMNHS < notifications@github.com> wrote:

[EXTERNAL]*

Here is a download of data for Ornithischia, excluding genus/species from the Paleobiology Database, it is a mix of ranked and unranked terms: PBDB Ornithischia.zip https://github.com/ArctosDB/arctos/files/5097457/PBDB.Ornithischia.zip I think having something like the hierarchical taxonomy editor that would work for unranked terms would be essential.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2231#issuecomment-676499011, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBC54DF4HUJPPICS5OTSBPWDVANCNFSM4IOXE42Q .

dustymc commented 4 years ago

That seems to be purely hierarchical - there's a term with zero or one parents (and some metadata, sometimes). Those data could be managed in some hierarchical tool, and as long as we don't have a need to flatten them then writing back to Arctos should be fairly straightforward. I think it could even take the shape of a built-in editor, as long as there's some mechanism to prevent adding inconsistent data (eg by disallowing access to the single-record editor - https://github.com/ArctosDB/arctos/issues/1698).

Jegelewicz commented 4 years ago

OK, I have found a flaw in the system (maybe).

Check out https://arctos.database.museum/name/Aphlebia

The insect usage of the name has been declared a synonym, so is not "valid" but the plant usage is valid. I have cloned in both classifications from GBIF (insect to the Arctos source and plant to Arctos Plants) and created the synonym relationship. Here's the rub. ALMNH:ES uses Arctos as the preferred source, with WoRMS (via Arctos) and Arctos Plants in succession. This means that they are going to wind up with the Arctos classification (insect) even if they really mean the plant and in this crazy scenario, they could potentially have both in their collection. Also, the plant version is not a synonym with Phyllodromica, but it is going to look that way now.

Sigh.

dustymc commented 4 years ago

You found a flaw in taxonomy, not Arctos....

create a new taxonomy source called "this stupid thing is a snail"
add one classification, for Aphlebia-the-snail
prefer it before all others
your Aphlebia are snails

potentially have both

There's not much of a taxonomy solution for that. Split the collection, use taxon concepts to clarify, ....

plant version is not a synonym with Phyllodromica, but it is going to look that way now.

Relationships help search. If you want to do more, then we need relationships between classifications (which means we need a completely different approach to how we treat classification data, which is hard to imagine happening without dedicated funding).

Jegelewicz commented 4 years ago

I figured we could create an ALMNH:ES source for stupid one-offs like this but only if their collection includes only insects OR plants...

There is a part of me that wants to say - taxonomists can't get their act together and I shouldn't have to fix that....

ArctosDB / arctos

taxonomy: dynamic classification sources #2231