Closed dustymc closed 3 years ago
Can you give me another way to search for these four taxa without a preferred classification for DMNS: Inv?
The SQL doesn't bring up anything - at least the way I ran it.
It should work there without the trailing semicolon, and you'll need to put in your actual GUID_Prefix.
select
taxon_name.scientific_name
from
identification_taxonomy,
identification,
cataloged_item,
collection,
taxon_name
where
identification_taxonomy.identification_id=identification.identification_id and
identification.collection_object_id=cataloged_item.collection_object_id and
cataloged_item.collection_id=collection.collection_id and
identification_taxonomy.taxon_name_id=taxon_name.taxon_name_id and
collection.guid_prefix='DMNS:Inv' and
taxon_name.taxon_name_id not in
(select taxon_name_id from taxon_term where
taxon_term.taxon_name_id = taxon_name.taxon_name_id and
taxon_term.source=collection.PREFERRED_TAXONOMY_SOURCE
)
;
SCIENTIFIC_NAME
------------------------------------------------------------------------------------------------------------------------
Nitidella gausapata
Cerithium septemstriatum
Gemmula gilchristie
Helicostyla stabilis
Thanks. It worked.
Still getting the same invalid character error when I try Dusty's code. It would be nice to click on the preferred taxonomy error and be sent directly to a table of the specimens with the errors. Any idea what I'm doing wrong?
@dustymc is the above possible?
The error is from the semicolon and the next line. Here's the code by itself
select
taxon_name.scientific_name
from
identification_taxonomy,
identification,
cataloged_item,
collection,
taxon_name
where
identification_taxonomy.identification_id=identification.identification_id and
identification.collection_object_id=cataloged_item.collection_object_id and
cataloged_item.collection_id=collection.collection_id and
identification_taxonomy.taxon_name_id=taxon_name.taxon_name_id and
collection.guid_prefix='UCSC:Herp' and
taxon_name.taxon_name_id not in
(select taxon_name_id from taxon_term where
taxon_term.taxon_name_id = taxon_name.taxon_name_id and
taxon_term.source=collection.PREFERRED_TAXONOMY_SOURCE
)
It finds nothing as of now.
Yes the purpose of this issue is to define and prioritize a report.
Just turn that code into a tool in Low Quality Data?
This report could use a modification or additional report.
It is showing as low quality data 779 taxa without a preferred classification. Actually, these are the 779 taxa that are not in WoRMS (via Arctos) so they default to our second source Arctos. Most of them are fossils or terrestrials that WoRMS does not include. This is helpful to know how many taxa are using Arctos classifications.
For the low quality data report, can we modify this to look for classifications from all listed sources before reporting that there is no preferred classification?
From https://github.com/ArctosDB/arctos/issues/3541
Suggest we change this to "missing from FLAT" which
Sorry, but what does "missing from FLAT" mean?
Will the report show a list of taxa used in IDs that have no (preferred) classification?
It would not be very informative - "has classification" and "has classification with expected data" are wildly different things
A classification might include a remark and nothing else; that would not be detected in a "taxon has classification" test.
Etc., etc., etc. - "something there" is what I could potentially detect, that's it, and we know that plenty of classifications do in fact have problems.
"Something in flat" is essentially "has a classification, and it does something useful."
select scientific_name from flat where guid_prefix='DMNS:Inv' and phylum is null group by scientific_name order by scientific_name;
finds nothing - all of your IDs found something to use for phylum, yay you!
arctosprod@arctos>> select scientific_name from flat where guid_prefix='DMNS:Inv' and family is null group by scientific_name order by scientific_name;
scientific_name
-------------------
Ammonitida
Bryozoa
Caridea
Decapoda
Demospongiae
Foraminifera
Gordius
Heterobranchia
Patellogastropoda
Porifera
(10 rows)
For those, the scripts have gone through your preferred sources and couldn't find a family value. If any of those are families or below, there's something wrong with all of your preferred sources. If they're not (I think not?) then your preferred classifications supply something useful for all of your records.
That helps. I always just search the collections periodically for records where the Kingdom (or phylum etc.) is NULL since SQL isn't my first language. It sounds like that's the best way for me for continue to look for problem records.
Yes, the ones with a NULL family are known to be that way. We can't ID them to any lower level at this time.
Also, this appears in my dashboard as bare_names. Will your "missing from flat" approach remove those from my problem list?
DMNS:Inv | bare_names | https://github.com/ArctosDB/arctos/issues/1894 | 780 taxa are used by DMNS:Inv and do not have a preferred classification. | {ts '2021-04-07 00:00:00'}
search the collections periodically for records where the Kingdom (or phylum etc.) is NULL
That's the same data as what I'm thinking, just from a more taxa-centric viewpoint.
Yes the Dashboard-->Bare Names thing needs rewritten; it's still working under the idea that a collection will prefer exactly one Source.
Some of those are probably a legacy of that, and some may be names used in nonaccepted IDs. The classifications attached to those don't really DO anything so I don't think that's a problem; if it is, maybe it's at least a rare (or strange) enough problem that it can be dealt with via SQL on a case-by-case basis.
@dustymc I think we can close this assuming that you have a rewrite of the Dashboard "Bare Names" (and same for the cheat sheet) on your To Do list. I'll leave closing it up to you.
@dustymc the link for taxa without a classification in the something random thingee
goes to this issue, but I think it should now go here - https://github.com/ArctosDB/documentation-wiki/blob/gh-pages/_sql_cheats/taxa_without_classification.markdown
@dustymc the sql to find the taxa with "bare names" doesn't work as I planned. Let's look at ALMNH:EH
When I use the code at the top of this issue, I get:
ERROR: column collection.preferred_taxonomy_source does not exist Position: 647
Which I figured was due to the fact there is no longer a single "preferred" classification. So I modified the SQL a bit and I thought it was working based on stuff I did last night, but when I run the revised SQL (which is here), I get no names
Help make the SQL better?
I can rewrite the query (I think!), but it would be very expensive and I don't think it can be made to run in the UI. Suggest checking computed terms in flat - https://github.com/ArctosDB/arctos/issues/1894#issuecomment-816080112
Well, how does Arctos know there are two names with no classification? And if it knows that, why can't it tell me what they are? If we can't make it easy for people to fix "low quality data" we might as well not tell them there is any.
how does Arctos know
Slowly. FLAT exists so the expensive computations don't have to happen at runtime.
"Has a classification" and "has a useful classification" are wildly different things. Flat gets at "useful" (eg, has expected ranks), along with performing reasonably well. I don't understand the complaint - this is a more informative approach, yay everybody?! https://github.com/ArctosDB/arctos/issues/1894#issuecomment-816080112
But the query above only looks for "phylum"? Nobody has time to check all the potential levels of every possible classification?
There are plenty of classifications that are missing "phylum"...
how does Arctos know
Slowly.
Keep a list as it goes? Is that asking too much?
check all the potential levels of every possible classification?
This approach just checks things that are used; I don't think there can be anything cleaner.
plenty of classifications that are missing "phylum"...
Some of them should be, some should not. Anyone running this probably recognizes the names that should have some term-at-rank but do not. That information isn't available in any other approach. I still obviously don't understand the problem.
Keep a list
Hu?
Then why mention family?
select scientific_name from flat where guid_prefix='DMNS:Inv' and family is null group by scientific_name order by scientific_name;
scientific_name
Family is just one term/rank that some collections/disciplines seem to care about.
I'd just replace family with other flat-terms and run multiple queries, but I'm not sure I understand what you're asking for here so that may not be useful. What precisely are the goals?
This will get at (or very near) "no classification data" if that's the question.
select scientific_name from flat where guid_prefix='DMNS:Inv' and
phylclass is null and
kingdom is null and
phylum is null and
phylorder is null and
family is null and
genus is null and
species is null and
subspecies is null and
author_text is null and
nomenclatural_code is null and
infraspecific_rank is null and
subfamily is null and
tribe is null and
subtribe is null
I'd just replace family with other flat-terms and run multiple queries,
That is way too much work to find two names with missing classifications.
I'm not sure I understand what you're asking for here
The problem is that I am being told there are two names used as identifications in my collection (ALMNH:EH) but I have NO WAY to figure out what they are....
This will get at (or very near) "no classification data" if that's the question.
That works for this and I think would be good for most cases. I'll play with it an see if anything needs added. Thanks!
being told
Aha! That's where we should be starting - that report is almost certainly not working correctly (https://github.com/ArctosDB/arctos/issues/1894#issuecomment-816236745) - so 2 questions
Just what it says - "bare names" a name used as an identification that has no classification in any of the collection's preferred sources.
There's a new report in next release
and the dashboard-creator-thing is no longer looking for taxonomy problems because I can't find a way to do so which doesn't return lots of false positives. It should be easy enough to reintroduce that if someone eventually identifies a specific need.
for #1419
sql: