ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

report needed: taxa used in IDs by a collection which do not have a preferred classification #1894

Closed dustymc closed 3 years ago

dustymc commented 5 years ago

for #1419

sql:


select
 taxon_name.scientific_name
from
  identification_taxonomy,
  identification,
  cataloged_item,
  collection,
  taxon_name
where
  identification_taxonomy.identification_id=identification.identification_id and
  identification.collection_object_id=cataloged_item.collection_object_id and 
  cataloged_item.collection_id=collection.collection_id and
  identification_taxonomy.taxon_name_id=taxon_name.taxon_name_id and
  collection.guid_prefix='{guid_prefix}' and
  taxon_name.taxon_name_id not in 
  (select taxon_name_id from taxon_term where 
    taxon_term.taxon_name_id = taxon_name.taxon_name_id and
    taxon_term.source=collection.PREFERRED_TAXONOMY_SOURCE
    )
;
sharpphyl commented 5 years ago

Can you give me another way to search for these four taxa without a preferred classification for DMNS: Inv?

screen shot 2019-02-10 at 9 11 49 am

The SQL doesn't bring up anything - at least the way I ran it.

screen shot 2019-02-10 at 9 11 21 am

dustymc commented 5 years ago

It should work there without the trailing semicolon, and you'll need to put in your actual GUID_Prefix.

select
 taxon_name.scientific_name
from
  identification_taxonomy,
  identification,
  cataloged_item,
  collection,
  taxon_name
where
  identification_taxonomy.identification_id=identification.identification_id and
  identification.collection_object_id=cataloged_item.collection_object_id and 
  cataloged_item.collection_id=collection.collection_id and
  identification_taxonomy.taxon_name_id=taxon_name.taxon_name_id and
  collection.guid_prefix='DMNS:Inv' and
  taxon_name.taxon_name_id not in 
  (select taxon_name_id from taxon_term where 
    taxon_term.taxon_name_id = taxon_name.taxon_name_id and
    taxon_term.source=collection.PREFERRED_TAXONOMY_SOURCE
    )
;

SCIENTIFIC_NAME
------------------------------------------------------------------------------------------------------------------------
Nitidella gausapata
Cerithium septemstriatum
Gemmula gilchristie
Helicostyla stabilis
sharpphyl commented 5 years ago

Thanks. It worked.

alexkrohn commented 4 years ago

Still getting the same invalid character error when I try Dusty's code. It would be nice to click on the preferred taxonomy error and be sent directly to a table of the specimens with the errors. Any idea what I'm doing wrong?

Screen Shot 2020-01-23 at 11 02 35 AM

Jegelewicz commented 4 years ago

@dustymc is the above possible?

dustymc commented 4 years ago

The error is from the semicolon and the next line. Here's the code by itself

select
 taxon_name.scientific_name
from
  identification_taxonomy,
  identification,
  cataloged_item,
  collection,
  taxon_name
where
  identification_taxonomy.identification_id=identification.identification_id and
  identification.collection_object_id=cataloged_item.collection_object_id and 
  cataloged_item.collection_id=collection.collection_id and
  identification_taxonomy.taxon_name_id=taxon_name.taxon_name_id and
  collection.guid_prefix='UCSC:Herp' and
  taxon_name.taxon_name_id not in 
  (select taxon_name_id from taxon_term where 
    taxon_term.taxon_name_id = taxon_name.taxon_name_id and
    taxon_term.source=collection.PREFERRED_TAXONOMY_SOURCE
    )

It finds nothing as of now.

Yes the purpose of this issue is to define and prioritize a report.

Jegelewicz commented 4 years ago

Just turn that code into a tool in Low Quality Data?

sharpphyl commented 3 years ago

This report could use a modification or additional report.

It is showing as low quality data 779 taxa without a preferred classification. Actually, these are the 779 taxa that are not in WoRMS (via Arctos) so they default to our second source Arctos. Most of them are fossils or terrestrials that WoRMS does not include. This is helpful to know how many taxa are using Arctos classifications.

For the low quality data report, can we modify this to look for classifications from all listed sources before reporting that there is no preferred classification?

dustymc commented 3 years ago

From https://github.com/ArctosDB/arctos/issues/3541

Suggest we change this to "missing from FLAT" which

  1. Is easy, and
  2. Finds "doesn't have useful data" which seems much more informative.
sharpphyl commented 3 years ago

Sorry, but what does "missing from FLAT" mean?

Will the report show a list of taxa used in IDs that have no (preferred) classification?

dustymc commented 3 years ago

It would not be very informative - "has classification" and "has classification with expected data" are wildly different things

A classification might include a remark and nothing else; that would not be detected in a "taxon has classification" test.

Etc., etc., etc. - "something there" is what I could potentially detect, that's it, and we know that plenty of classifications do in fact have problems.

"Something in flat" is essentially "has a classification, and it does something useful."

select scientific_name from flat where guid_prefix='DMNS:Inv' and phylum is null group by scientific_name order by scientific_name;

finds nothing - all of your IDs found something to use for phylum, yay you!


arctosprod@arctos>> select scientific_name from flat where guid_prefix='DMNS:Inv' and family is null group by scientific_name order by scientific_name;
  scientific_name  
-------------------
 Ammonitida
 Bryozoa
 Caridea
 Decapoda
 Demospongiae
 Foraminifera
 Gordius
 Heterobranchia
 Patellogastropoda
 Porifera
(10 rows)

For those, the scripts have gone through your preferred sources and couldn't find a family value. If any of those are families or below, there's something wrong with all of your preferred sources. If they're not (I think not?) then your preferred classifications supply something useful for all of your records.

sharpphyl commented 3 years ago

That helps. I always just search the collections periodically for records where the Kingdom (or phylum etc.) is NULL since SQL isn't my first language. It sounds like that's the best way for me for continue to look for problem records.

Yes, the ones with a NULL family are known to be that way. We can't ID them to any lower level at this time.

Also, this appears in my dashboard as bare_names. Will your "missing from flat" approach remove those from my problem list?

DMNS:Inv | bare_names | https://github.com/ArctosDB/arctos/issues/1894 | 780 taxa are used by DMNS:Inv and do not have a preferred classification. | {ts '2021-04-07 00:00:00'}

dustymc commented 3 years ago

search the collections periodically for records where the Kingdom (or phylum etc.) is NULL

That's the same data as what I'm thinking, just from a more taxa-centric viewpoint.

Yes the Dashboard-->Bare Names thing needs rewritten; it's still working under the idea that a collection will prefer exactly one Source.

Some of those are probably a legacy of that, and some may be names used in nonaccepted IDs. The classifications attached to those don't really DO anything so I don't think that's a problem; if it is, maybe it's at least a rare (or strange) enough problem that it can be dealt with via SQL on a case-by-case basis.

sharpphyl commented 3 years ago

@dustymc I think we can close this assuming that you have a rewrite of the Dashboard "Bare Names" (and same for the cheat sheet) on your To Do list. I'll leave closing it up to you.

Jegelewicz commented 3 years ago

@dustymc the link for taxa without a classification in the something random thingee

image

goes to this issue, but I think it should now go here - https://github.com/ArctosDB/documentation-wiki/blob/gh-pages/_sql_cheats/taxa_without_classification.markdown

Jegelewicz commented 3 years ago

@dustymc the sql to find the taxa with "bare names" doesn't work as I planned. Let's look at ALMNH:EH image

When I use the code at the top of this issue, I get:

ERROR: column collection.preferred_taxonomy_source does not exist Position: 647

Which I figured was due to the fact there is no longer a single "preferred" classification. So I modified the SQL a bit and I thought it was working based on stuff I did last night, but when I run the revised SQL (which is here), I get no names image

Help make the SQL better?

dustymc commented 3 years ago

I can rewrite the query (I think!), but it would be very expensive and I don't think it can be made to run in the UI. Suggest checking computed terms in flat - https://github.com/ArctosDB/arctos/issues/1894#issuecomment-816080112

Jegelewicz commented 3 years ago

Well, how does Arctos know there are two names with no classification? And if it knows that, why can't it tell me what they are? If we can't make it easy for people to fix "low quality data" we might as well not tell them there is any.

dustymc commented 3 years ago

how does Arctos know

Slowly. FLAT exists so the expensive computations don't have to happen at runtime.

"Has a classification" and "has a useful classification" are wildly different things. Flat gets at "useful" (eg, has expected ranks), along with performing reasonably well. I don't understand the complaint - this is a more informative approach, yay everybody?! https://github.com/ArctosDB/arctos/issues/1894#issuecomment-816080112

Jegelewicz commented 3 years ago

But the query above only looks for "phylum"? Nobody has time to check all the potential levels of every possible classification?

There are plenty of classifications that are missing "phylum"...

Jegelewicz commented 3 years ago
how does Arctos know

Slowly.

Keep a list as it goes? Is that asking too much?

dustymc commented 3 years ago

check all the potential levels of every possible classification?

This approach just checks things that are used; I don't think there can be anything cleaner.

plenty of classifications that are missing "phylum"...

Some of them should be, some should not. Anyone running this probably recognizes the names that should have some term-at-rank but do not. That information isn't available in any other approach. I still obviously don't understand the problem.

Keep a list

Hu?

Jegelewicz commented 3 years ago

Then why mention family?

select scientific_name from flat where guid_prefix='DMNS:Inv' and family is null group by scientific_name order by scientific_name;
  scientific_name 
dustymc commented 3 years ago

Family is just one term/rank that some collections/disciplines seem to care about.

I'd just replace family with other flat-terms and run multiple queries, but I'm not sure I understand what you're asking for here so that may not be useful. What precisely are the goals?

This will get at (or very near) "no classification data" if that's the question.


select scientific_name from flat where guid_prefix='DMNS:Inv' and
phylclass is null and
kingdom is null and
phylum is null and
phylorder is null and
family is null and
genus is null and
species is null and
subspecies is null and
author_text is null and
nomenclatural_code is null and
infraspecific_rank is null and
subfamily is null and
tribe is null and
subtribe is null
Jegelewicz commented 3 years ago

I'd just replace family with other flat-terms and run multiple queries,

That is way too much work to find two names with missing classifications.

I'm not sure I understand what you're asking for here

The problem is that I am being told there are two names used as identifications in my collection (ALMNH:EH) but I have NO WAY to figure out what they are....

This will get at (or very near) "no classification data" if that's the question.

That works for this and I think would be good for most cases. I'll play with it an see if anything needs added. Thanks!

dustymc commented 3 years ago

being told

Aha! That's where we should be starting - that report is almost certainly not working correctly (https://github.com/ArctosDB/arctos/issues/1894#issuecomment-816236745) - so 2 questions

  1. What (if anything) should be be checking, and
  2. Then what?
Jegelewicz commented 3 years ago

Just what it says - "bare names" a name used as an identification that has no classification in any of the collection's preferred sources.

dustymc commented 3 years ago

There's a new report in next release

Screen Shot 2021-10-01 at 8 18 46 AM

and the dashboard-creator-thing is no longer looking for taxonomy problems because I can't find a way to do so which doesn't return lots of false positives. It should be easy enough to reintroduce that if someone eventually identifies a specific need.