ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

taxonRank and aggregators #1338

Closed dustymc closed 6 years ago

dustymc commented 6 years ago

Without taxonRank, iDigBio "fixes" various taxonomy terms to random values which are sometimes completely unrelated to the original ID.

taxonRank is not a required field in DWC.

Arctos does not require taxa to be ranked (which is an accurate representation of taxonomy itself). Some identifications do not use taxa at all, others use multiple taxa, all of it may be ranked or not.

iDigBio's suggestion is to "add[] taxonRank as a required field in Arctos" which isn't possible or practical for many reasons.

When ranked taxonomy is available we fill out the appropriate "columns" in DWC - "Family," "Order" etc. From that we could find the most specific term which is ranked, but not "The taxonomic rank of the most specific name in the scientificName" (as specified in the Standard). The lowest ranked term also does not necessarily appear in the scientificName at all.

I can't quite see what we could do before exporting the DWC that wouldn't just be wrong in some instances. As always, I'm open to suggestions.

tucotuco commented 6 years ago

I recommend that, if the identification is to a single name at a single rank (majority of cases), provide it, otherwise leave it blank.

atrox10 commented 6 years ago

I would fill in something, even just animalia or plantar, if there’s no genus, otherwise I think both GBiF and iDigBio will put something crazy in for identification.

On Thu, Nov 30, 2017 at 10:05 AM John Wieczorek notifications@github.com wrote:

I recommend that, if the identification is to a single name at a single rank (majority of cases), provide it, otherwise leave it blank.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-348271218, or mute the thread https://github.com/notifications/unsubscribe-auth/AESS8SKqexHsyrJQ54OckvKNxAKT3SpGks5s7u5agaJpZM4Qw4oN .

-- Sent from Gmail Mobile

dustymc commented 6 years ago

single name at a single rank

Seems possible, if perhaps somewhat expensive. I'll explore.

fill in something

I am REALLY hesitant to ignore published Standards, and even if we do our data will not always resolve to a singular "something." There may be a useful default, but I'm going to need explicit instructions for finding it.

iDigBio will put something crazy in for identification

This is obviously a bug in iDigBio.

atrox10 commented 6 years ago

I agree about not wanting to ignore published standards, but then why are iDigBio and GBIF filling something in for this? That's what Joanna told me, if there is nothing in that field, then iDigBio fills in something and GBIF requires it too.

John or Dusty - can you find out from someone at GBIF if they are doing the same thing as iDigBio (requiring something in this field and if it's not there, filling in identifications with random stuff)?

On Thu, Nov 30, 2017 at 10:19 AM, dustymc notifications@github.com wrote:

single name at a single rank

Seems possible, if perhaps somewhat expensive. I'll explore.

fill in something

I am REALLY hesitant to ignore published Standards, and even if we do our data will not always resolve to a singular "something." There may be a useful default, but I'm going to need explicit instructions for finding it.

iDigBio will put something crazy in for identification

This is obviously a bug in iDigBio.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-348275229, or mute the thread https://github.com/notifications/unsubscribe-auth/AESS8TxOFFEyOb6LYWadb12d55xxWrsxks5s7vHAgaJpZM4Qw4oN .

-- Carol L. Spencer, Ph.D. Staff Curator of Herpetology & Researcher Museum of Vertebrate Zoology 3101 Valley Life Sciences Building University of California, Berkeley, CA, USA 94720-3160 atrox10@gmail.com or atrox@berkeley.edu 510-643-5778 http://mvz.berkeley.edu/

dustymc commented 6 years ago

why are iDigBio and GBIF filling something in for this?

That is the question!

GBIF if they are doing the same thing as iDigBio

See https://www.idigbio.org/portal/records/24cc3e24-0cac-4a54-9877-5f458a191e18 (Decapoda: Animalia > Arthropoda > Insecta > Orthoptera > Tettigoniidae) vs https://www.gbif.org/occurrence/1145113729 (Decapoda: Animalia Arthropoda Malacostraca). GBIF is behaving predictably here. (But see https://github.com/ArctosDB/arctos/issues/1291#issuecomment-334505921. GBIF has no idea how to handle ISO8601 dates for some reason. Manipulation by "portals" seems to always cause problems which users likely interpret as "Arctos is broken.")

I wrote a simple script to return rank for "single name at a single rank." It's running against accepted IDs in Arctos now and it should have done something in a few days. (I think it would be usably-fast in prod, I've got it throttled heavily to prevent any unanticipated problems for now). I'll post whatever falls out when it's done; perhaps there will be some solution to the more complicated situations evident from those data.

tucotuco commented 6 years ago

GBIF's suggestions (most of their "requirements" are not actually required in practice) can be found at http://www-old.gbif.org/publishing-data/quality. The taxonRank field is only "strongly recommended". iDigBio uses the GBIF taxonomic backbone as a data source against which to validate taxa, but they do not use the same process to determine the valid classification. You can see this in the GBIF record of the same specimen:

GBIF: https://www.gbif.org/occurrence/1145113729 iDigBio: https://www.idigbio.org/portal/records/24cc3e24-0cac-4a54-9877-5f458a191e18

On Thu, Nov 30, 2017 at 4:51 PM, Carol notifications@github.com wrote:

I agree about not wanting to ignore published standards, but then why are iDigBio and GBIF filling something in for this? That's what Joanna told me, if there is nothing in that field, then iDigBio fills in something and GBIF requires it too.

John or Dusty - can you find out from someone at GBIF if they are doing the same thing as iDigBio (requiring something in this field and if it's not there, filling in identifications with random stuff)?

On Thu, Nov 30, 2017 at 10:19 AM, dustymc notifications@github.com wrote:

single name at a single rank

Seems possible, if perhaps somewhat expensive. I'll explore.

fill in something

I am REALLY hesitant to ignore published Standards, and even if we do our data will not always resolve to a singular "something." There may be a useful default, but I'm going to need explicit instructions for finding it.

iDigBio will put something crazy in for identification

This is obviously a bug in iDigBio.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-348275229, or mute the thread https://github.com/notifications/unsubscribe-auth/ AESS8TxOFFEyOb6LYWadb12d55xxWrsxks5s7vHAgaJpZM4Qw4oN .

-- Carol L. Spencer, Ph.D. Staff Curator of Herpetology & Researcher Museum of Vertebrate Zoology 3101 Valley Life Sciences Building University of California, Berkeley, CA, USA 94720-3160 atrox10@gmail.com or atrox@berkeley.edu 510-643-5778 <(510)%20643-5778> http://mvz.berkeley.edu/

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-348300850, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP6y7nBNzNsoKL1anLCpVwhjc0TnVqks5s7wcggaJpZM4Qw4oN .

dustymc commented 6 years ago

There's a first-pass attempt at getting taxonRank at https://github.com/ArctosDB/DDL/blob/master/functions/getTaxonRank.sql.

Here's the result:

UAM@ARCTOS> select taxon_rank || ' @ ' || count(*) from temp_test_taxon_rank group by taxon_rank order by taxon_rank;

TAXON_RANK||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
author_text @ 1
canonical name @ 40
canonical_name @ 1
class @ 24240
error!: ORA-01403: no data found @ 645026
error!: ORA-01422: exact fetch returns more than requested number of rows @ 13692
family @ 156720
forma @ 11
genus @ 53231
hyporder @ 574
infraclass @ 4
infraorder @ 1
kingdom @ 3777
order @ 93419
phylum @ 17896
species @ 1911536
subclass @ 536
subdivision @ 1
subfamily @ 5862
suborder @ 3212
subphylum @ 20
subpspecies @ 55
subspecies @ 577788
superfamily @ 2576
superorder @ 82
tribe @ 552
variety @ 5133

27 rows selected.

Adjusting the script to use only classification terms from https://arctos.database.museum/info/ctDocumentation.cfm?table=CTTAXON_TERM or something would clean up a few things, but would also slow down the script. Perhaps we should clean up our taxonomy instead?

I don't see a pathway to

the ranks used have to be (major) Linnean ranks: kingdom, phylum, class, order, family, genus, species.

@ http://www-old.gbif.org/publishing-data/quality#dcTaxonRank2 - will violating that break something else?

"Error" data attached. OK, they're not because it's too big and https://github.com/ArctosDB/arctos/issues/1345. I'll email it by request.

How should I proceed, if I should proceed?

tucotuco commented 6 years ago

"The taxonomic rank of the most specific name in the scientificName. Recommended best practice is to use a controlled vocabulary."

Anything else is a myth. :-)

The majority of records use ranks that iDigBio will understand. The likelihood of a mis-classification such as the one that started this will be reduced immensely. Given that it does not happen often anyway, we can probably call it vanishingly small (which in turn just means that it'll take an extra week to find one).

On Mon, Dec 4, 2017 at 3:46 PM, dustymc notifications@github.com wrote:

There's a first-pass attempt at getting taxonRank at https://github.com/ArctosDB/DDL/blob/master/functions/getTaxonRank.sql.

Here's the result:

UAM@ARCTOS> select taxon_rank || ' @ ' || count(*) from temp_test_taxon_rank group by taxon_rank order by taxon_rank;

TAXON_RANK||'@'||COUNT(*)

author_text @ 1 canonical name @ 40 canonical_name @ 1 class @ 24240 error!: ORA-01403: no data found @ 645026 error!: ORA-01422: exact fetch returns more than requested number of rows @ 13692 family @ 156720 forma @ 11 genus @ 53231 hyporder @ 574 infraclass @ 4 infraorder @ 1 kingdom @ 3777 order @ 93419 phylum @ 17896 species @ 1911536 subclass @ 536 subdivision @ 1 subfamily @ 5862 suborder @ 3212 subphylum @ 20 subpspecies @ 55 subspecies @ 577788 superfamily @ 2576 superorder @ 82 tribe @ 552 variety @ 5133

27 rows selected.

Adjusting the script to use only classification terms from https://arctos.database.museum/info/ctDocumentation.cfm?table=CTTAXON_TERM or something would clean up a few things, but would also slow down the script. Perhaps we should clean up our taxonomy instead?

I don't see a pathway to

the ranks used have to be (major) Linnean ranks: kingdom, phylum, class, order, family, genus, species.

@ http://www-old.gbif.org/publishing-data/quality#dcTaxonRank2 - will violating that break something else?

"Error" data attached. OK, they're not because it's too big and #1345 https://github.com/ArctosDB/arctos/issues/1345. I'll email it by request.

How should I proceed, if I should proceed?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-349063575, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP6xC247OehyM4GDpWuyqpdkPeQDwDks5s9D4igaJpZM4Qw4oN .

dustymc commented 6 years ago

an extra week to find one

That's part of my concern - we do something random, {whoever} turns it into some sort of "users think Arctos is broken" garbage, we don't notice because it's ONLY a few tens of thousands of records....

A way of knowing what's going on in portals would be really great. "We don't like something somewhere in this record for reasons we're not going to share" is difficult to work with.

dustymc commented 6 years ago

From AWG meeting:

?????????

@atrox10

ekrimmel commented 6 years ago

FWIW, links to TDWG's ideas for standardizing data quality tests across aggregators. I don't know how far along in practice any of this is.

tucotuco commented 6 years ago

There will be a meeting 16-19 January in Gainesville to finalize these tests and assertions and build out pseudo-code and create test data sets for these. The activity is urgent, as ALA, iDigBio, VertNet, Kurator, and GBIF all seek to implement the same algorithms, providing the same results on given input data.

On Tue, Dec 5, 2017 at 5:08 PM, Erica Krimmel notifications@github.com wrote:

FWIW, links to TDWG's ideas https://github.com/tdwg/bdq/blob/master/tg2/README.md for standardizing data quality tests https://docs.google.com/spreadsheets/d/1td7zJ9GH3WWhu0Pa1X-1fkaWk71U8qqr54-kkbfwbfE/edit#gid=339716286 across aggregators. I don't know how far along in practice any of this is.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-349425552, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP63bauKTlKAmxeQsx6PDbTXIA8iYPks5s9aLKgaJpZM4Qw4oN .

Jegelewicz commented 6 years ago

Re-invigorating this thread as I will be talking about it at SPNHC.

Would it not be possible to have our names table include a field that was "taxon rank"? So:

scientific name, taxon rank Aves, class Bufo americanus, species

and so on. This way whatever identification is with the specimen would also tell iDigBio, GBIF that the ID is referring to a specific rank, regardless of what the Arctos classification has in it.

See also #1607

dustymc commented 6 years ago

names table

Names are not consistently ranked. Diptera is a genus and order, for example. We already (optionally) have ranks in classifications.

I'm not sure how #1607 is related.

Jegelewicz commented 6 years ago

This is what is causing the Aves/Avus problem. As we are not passing a rank, they are assuming that Aves is a genus and that we are misspelling Avus.

Names are not consistently ranked. Diptera is a genus and order, for example. We already (optionally) have ranks in classifications.

And that is a problem for anyone who uses "Diptera" by itself (even with sp.) as an identification, because that is showing up at iDigBio and they don't care what we put in the classification. If we don't tell them via taxonrank that we are talking about the order, they will just assume it is the genus. Maybe names should be a pair. Name + taxon rank.

So we would have: Diptera, order Diptera, genus

and that would allow each to have it's own, proper classification.

dustymc commented 6 years ago

they are assuming

I can't really do anything about that.

anyone who uses "Diptera" by itself (even with sp.) as an identification,

That's why we deal in data objects rather than strings.

http://arctos.database.museum/name/Diptera#Arctos http://arctos.database.museum/name/Diptera#ArctosPlants

talking about the order

We provide that information. It's not always simple enough to pick a rank.

UAM@ARCTOS> select distinct phylclass from flat where scientific_name='Aves';

PHYLCLASS
------------------------------------------------------------------------------------------------------------------------
Aves

1 row selected.

Maybe names should be a pair. Name + taxon rank.

"Echidna, genus" is an eel and a snake and a mammal and some other stuff - that does not clarify anything in a great number of cases. It would also require ranks, which is not a useful taxonomy model.

Jegelewicz commented 6 years ago
they are assuming

I can't really do anything about that.

According to them, we can. All we have to do is provide the taxon_rank. We should figure out a way to do so.

dustymc commented 6 years ago

I'm certainly up for ideas! Mine's above - I'm not sure what could be done with the ~half-million misses.

Jegelewicz commented 6 years ago

How about adding a field like ID_Name_Rank with a controlled vocabulary that includes the ranks that iDigBio is looking for? We could populate the field for stuff already in Arctos from the lowest rank on the taxonomic classification, if there is one, attached to the taxon name. This information wouldn't have to be presented to the public. Like encumbered information, perhaps it only needs to be visible to operators.

For non-biological collections, this might be useful in other ways, but for now, they could use a term such as "not applicable" or "not provided".

campmlc commented 6 years ago

I support Teresa's suggestion. Since iDigBio can't seem to fix it on their end without input from us, we need to fix it. I don't want our data to look bad from the aggregators' portals, since for so many people that is all they look at to search for records.

On Wed, Aug 1, 2018 at 1:24 PM, Teresa Mayfield notifications@github.com wrote:

How about adding a field like ID_Name_Rank with a controlled vocabulary that includes the ranks that iDigBio is looking for? We could populate the field for stuff already in Arctos from the lowest rank on the taxonomic classification, if there is one, attached to the taxon name. This information wouldn't have to be presented to the public. Like encumbered information, perhaps it only needs to be visible to operators.

For non-biological collections, this might be useful in other ways, but for now, they could use a term such as "not applicable" or "not provided".

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-409691583, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hGW565TSPS5fYKlU0Cl4t4Obm1W7ks5uMgBdgaJpZM4Qw4oN .

dustymc commented 6 years ago

ID_Name_Rank

I'm not following - where would you store this?

ranks that iDigBio is looking for

That's another problem.

Names should be scientific (latin) names at major Linnean ranks, like “Animalia” (kingdom) or “Rosaceae” (family). Not: common names (“animals”), abbreviations (“Rosac.”), intermediate rank levels (“Tetrapoda” (superclass)), or polyphyletic or non-taxonomic groupings (“algae”, “herbivora”).

Our "lowest ranks" do include things like superclass.

Jegelewicz commented 6 years ago

ID_Name_Rank would go with Identification fields:

TAXON_NAME ID_NAME_RANK ID_MADE_BY_AGENT MADE_DATE NATURE_OF_ID IDENTIFICATION_REMARKS
Jegelewicz commented 6 years ago

ID_NAME_RANK the level of classification to which the TAXON_NAME belongs

Values would include, but not be limited to:

Kingdom Phylum Class Order Family Genus Species Subspecies

dustymc commented 6 years ago

to which the TAXON_NAME

I think I understand, but that's not quite the right verbiage. (And I can get that from the classification data when things are that simple.) IDs are two levels away from taxon names - they implicitly include classification data, and may be comprised of zero or many taxa. I think I'd suggest being more straightforward, unless there's some use case I'm not seeing.

Term: taxonRank Definition: value to provide for DWC:taxonRank

Subspecies

I'd guess that falls in "intermediate rank levels" which idigbio seems to not recognize, but there's no documentation that I can find so I don't really know.

Jegelewicz commented 6 years ago

So does that seem like a workable solution?

dustymc commented 6 years ago

It's workable in that if someone provides something I can send whatever they shove in there along with DWC exports. I don't think it's very usable.

Most of the problems with the script above involve uncertainty. An "A sp." determination is explicitly not to species; it's to the (usually) genus. I propose to ignore it anyway, and rerun the script without all of the uncertainty markers. That's just wrong, but maybe it's close enough to make iDigBio happy anyway.

Jegelewicz commented 6 years ago

@dustymc I woke up thinking about this and wondering if, within Arctos, the solution proposed in the AWG meeting yesterday would be better than the above. That solution was that the taxon name be paired with the author of the name and that pair would then link to the classification.

"Echidna, genus" is an eel and a snake and a mammal and some other stuff - that does not clarify anything in a great number of cases.

The above solution would let us have: Echidna J. R. Forster, 1788, a genus of moray eels

Rank: Name
Kingdom: Animalia
Phylum: Chordata
Class: Actinopterygii
Order: Anguilliformes
Family: Muraenidae
Subfamily: Muraeninae
Genus: Echidna

Echidna Cuvier, 1797, a junior homonym referring to the mammals commonly known as echidnas

Rank: Name
Kingdom: Animalia
Phylum: Chordata
Class: Mammalia
Order: Monotremata
Suborder: Tachyglossa
Family: Tachyglossidae

Echidna Merrem, 1820, junior homonym for a genus of African snakes now treated as Bitis

Rank: Name
Phylum: Chordata
Class: Reptilia
Order: Squamata
Suborder: Serpentes
Family: Viperidae
Subfamily: Viperinae
Genus: Echidna

The next step would be to get iDigBio to use this concept (which I think would make sense to them, because ranks are subject to change, but authors not so much).

dustymc commented 6 years ago

1) That would require digging up author info on the ~2.5M names we have 2) It's impossible to tell if that scales - maybe some other Merrem slapped "Echidna" on a grasshopper in 1820. 3) I'm fairly sure it would break non-Linnean taxonomies. 4) It would (I think) require things like getting the kiddos to properly format "display name" in the bulkloader.

I cannot grasp what problem you're trying to solve with this.

Jegelewicz commented 6 years ago

That would require digging up author info on the ~2.5M names we have

It is really only pertinent (for the purposes of iDigBio) to anything with a single name (not a binomial species name). While it would be nice to have everything, for now we can focus on the names that represent genus and higher classifications.

It's impossible to tell if that scales - maybe some other Merrem slapped "Echidna" on a grasshopper in 1820.

Yeah, maybe, but I suggest it's a very LOW probability and that when we run accross one of these we deal with it then.

I'm fairly sure it would break non-Linnean taxonomies.

I am not sure I understand this, are you talking about non-biological collections? I need more info to be able to respond.

It would (I think) require things like getting the kiddos to properly format "display name" in the bulkloader.

Well, I submit that isn't all that different for getting everyone to spell taxa properly in the bulkloader. If it isn't in the table you either fix your error or add it if necessary.

I cannot grasp what problem you're trying to solve with this.

I would like iDigBio to know with certainty that when I say Aves, I mean a bird. That the identification "Aves" is not a misspelling of the genus "Avus". I would also like us to be able to distinguish, within Arctos, between Heteroceras D'Orbigny, 1849 (the mollusk) from Heteroceras (the arthropod, for which I can find zero support anywhere, but that makes a classification appear like this:) image

because right now the two classifications just get lumped together:

image

dustymc commented 6 years ago

single name (not a binomial species name). While it would be nice to have everything, for now we can focus on the names that represent genus and higher classifications.

I can do that for formula "A" IDs.

And we DO provide those data in DWC - class=Aves is provided. I don't know why iDigBio ignores that in favor of something that's not so trivial to produce.

very LOW probability

For something that's "data" I'd generally argue that that's not good enough. For something that exists only to try to placate some broken related resource, I'm all in. (And please see my comment above re: ignoring uncertainty markers - sorta similar outlook, I think.)

able to distinguish, within Arctos, between Heteroceras D'Orbigny, 1849 (the mollusk) from Heteroceras (the arthropod, for which I can find zero support anywhere, but that makes a classification appear like this:)

The data at http://arctos.database.museum/name/Heteroceras#Arctos should be read as "it's one of these things, IDK which one." There's only ambiguity because someone has explicitly created it.

(Wild guess, the insect was aiming for http://arctos.database.museum/name/Heterocerus and missed.)

That is perhaps the most significant known limitation in the current taxonomy model. If those things are both real and we have a compelling reason to have them both (eg, we have types) in a single collection or collections which share a classification, then the model cannot be explicit. The "proper" response would be to split the classification, or recatalog one of the specimens in a collection which uses a different classification, or some combination thereof.

campmlc commented 6 years ago

Dusty, This is the other talk Teresa is giving at SPNHC in two weeks. How close are we to having a workaround? Even just a mockup example? Something Teresa can show when she discusses this problem? Even better to show that Arctos can solve it/create a workable bandaid when other databases can't, since this is a global problem with iDigBio.

On Fri, Aug 10, 2018 at 1:34 PM, dustymc notifications@github.com wrote:

single name (not a binomial species name). While it would be nice to have everything, for now we can focus on the names that represent genus and higher classifications.

I can do that for formula "A" IDs.

And we DO provide those data in DWC - class=Aves is provided. I don't know why iDigBio ignores that in favor of something that's not so trivial to produce.

very LOW probability

For something that's "data" I'd generally argue that that's not good enough. For something that exists only to try to placate some broken related resource, I'm all in. (And please see my comment above re: ignoring uncertainty markers - sorta similar outlook, I think.)

able to distinguish, within Arctos, between Heteroceras D'Orbigny, 1849 (the mollusk) from Heteroceras (the arthropod, for which I can find zero support anywhere, but that makes a classification appear like this:)

The data at http://arctos.database.museum/name/Heteroceras#Arctos should be read as "it's one of these things, IDK which one." There's only ambiguity because someone has explicitly created it.

(Wild guess, the insect was aiming for http://arctos.database.museum/ name/Heterocerus and missed.)

That is perhaps the most significant known limitation in the current taxonomy model. If those things are both real and we have a compelling reason to have them both (eg, we have types) in a single collection or collections which share a classification, then the model cannot be explicit. The "proper" response would be to split the classification, or recatalog one of the specimens in a collection which uses a different classification, or some combination thereof.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-412183770, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hBxNS4ZGxoTzhEZyxTHh7Lnp4yUsks5uPeBQgaJpZM4Qw4oN .

dustymc commented 6 years ago

I'm waiting on feedback from ya'll:

Most of the problems with the script above involve uncertainty. An "A sp." determination is explicitly not to species; it's to the (usually) genus. I propose to ignore it anyway, and rerun the script without all of the uncertainty markers. That's just wrong, but maybe it's close enough to make iDigBio happy anyway.

Does that sound OK?

campmlc commented 6 years ago

Sounds OK to me. It can't be more broken than it is already, and worth a try?

On Tue, Aug 14, 2018 at 7:38 PM, dustymc notifications@github.com wrote:

I'm waiting on feedback from ya'll:

Most of the problems with the script above involve uncertainty. An "A sp." determination is explicitly not to species; it's to the (usually) genus. I propose to ignore it anyway, and rerun the script without all of the uncertainty markers. That's just wrong, but maybe it's close enough to make iDigBio happy anyway.

Does that sound OK?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-413068000, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hMBT3Q7pBnPy675cucec63QRvMZVks5uQ3t4gaJpZM4Qw4oN .

dustymc commented 6 years ago

can't be more broken than it is already

Sure it can! We're currently saying nothing, what with us not having the data and the concept being optional and all. I'm proposing outright lies (which hopefully get the right idea across anyway).

I'll get the scripts running unless someone stops me in the very near future.

anna-chinn commented 6 years ago

Could we store both a taxon name and a specific classification as part of an identification (/in the identification_taxonomy table)? That way, where multiple classifications exist for a given taxon name, we could explicitly select the classification we want to follow for each specimen identification we create. taxonRank could be populated by the lowest ranked taxon in the preferred classification on the back end.

Doesn't help with the issues surrounding uncertain identifications, but maybe could disambiguate things in the future?

dustymc commented 6 years ago

Could we store both a taxon name and a specific classification as part of an identification (/in the identification_taxonomy table)?

That would require identifying a classification when bulkloading specimens, which I don't think is practical/usable. That's certainly a workable model from a data design perspective and I'm completely open for suggestions regarding how it might be implemented.

explicitly select the classification we want to follow

While Arctos allows multiple classifications at the intersection of a collection's preferred classification and a name, I don't think that's ever the "correct" approach (under the current model). If this is a problem, we should probably be talking about administrative organization (eg, splitting classifications or collections).

I don't see how this could help with taxon rank - eg "Mus musculus domesticus and Siphonaptera" would produce two taxon_ranks (there are two rows associated with that ID in identification_taxonomy, one for each involved taxon) and DWC demands one value.

Jegelewicz commented 6 years ago

Could we store both a taxon name and a specific classification as part of an identification (/in the identification_taxonomy table)? That way, where multiple classifications exist for a given taxon name, we could explicitly select the classification we want to follow for each specimen identification we create. taxonRank could be populated by the lowest ranked taxon in the preferred classification on the back end.

I go back to this:

ID_Name_Rank would go with Identification fields:

TAXON_NAME ID_NAME_RANK ID_MADE_BY_AGENT MADE_DATE NATURE_OF_ID IDENTIFICATION_REMARKS

We can call it TaxonRank as per Dusty's suggestion.

campmlc commented 6 years ago

Could we do both, eg specify the taxon rank of the current name, and add a higher taxon eg phylum or class to match it to a classification? This would be like what we do with higher geography. We could disambiguate Aves (Class in phylum Chordata) from Avus (genus in phylum arthropoda or mollusca etc). Surely there must be very few homonyms within a single phylum or class. For the longer term, I support being able to choose a classification, however that can be implemented.

On Wed, Aug 15, 2018, 1:00 PM Teresa Mayfield notifications@github.com wrote:

Could we store both a taxon name and a specific classification as part of an identification (/in the identification_taxonomy table)? That way, where multiple classifications exist for a given taxon name, we could explicitly select the classification we want to follow for each specimen identification we create. taxonRank could be populated by the lowest ranked taxon in the preferred classification on the back end.

I go back to this:

ID_Name_Rank would go with Identification fields: TAXON_NAME ID_NAME_RANK ID_MADE_BY_AGENT MADE_DATE NATURE_OF_ID IDENTIFICATION_REMARKS

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-413263638, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hOLfPjQsOJDxKy8LTKp4Uoz7XRHWks5uRFOhgaJpZM4Qw4oN .

dustymc commented 6 years ago

add a higher taxon eg phylum or class to match it to a classification

That is not an unambiguous pathway. If we go there, it would have to be by using some unambiguously unique term, or combination of terms, to pick the classification. We do that in geography - there's a unique key on higher_geog so it's a good proxy to the ID. I know of no necessarily-unique combinations of data in classifications other than the classification_id (which is an arbitrary string/not "data").

disambiguate Aves (Class in phylum Chordata) from Avus

There's no need to do that in Arctos. (And there should be no need to do it anywhere!) The only reason you'd need to disambiguate is if you have two classifications in your preferred classification schema for a name.

or similar, in which case I'd suggest it's time to split classifications so we can remove one of them from each.

very few homonyms within a single phylum or class

Nah, there are at least tens of thousands of them. Fortunately we don't have to care - Echidna is unambiguously a fish in collections that prefer the Arctos classification because http://arctos.database.museum/name/Echidna#Arctos. For contrast, according to http://arctos.database.museum/name/Echidna#TheInterimRegisterofMarineandNonmarineGenera Echidna is a mammal, snake, fish, moth, mammal again for some reason, snake again, fish again, or maybe a moth, again. Given a spiky mammal and the need to pick a classification, I'm not terribly convinced that anyone's going to predictably pick the one that starts with "Biota" instead of the one that starts with "Animalia" (or vise-versa).

I'm lost. These comments make me think that, in addition to iDigBio being broken (which my scripts should more or less fix), there's some other problem that you're trying to solve by picking classifications. Can you clarify?

Jegelewicz commented 6 years ago

iDigBio is not broken. They are making an attempt to bring together thousands of collections that send them data in as many schemas. I would like to solve the problem that this thread began with - getting a taxon rank to iDigBio and any other aggregator who wants to use it. I would like this to be a field that I can see and edit in Arctos, thus:

Taxon_Rank would go with Identification fields:

TAXON_NAME TAXON_RANK ID_MADE_BY_AGENT MADE_DATE NATURE_OF_ID IDENTIFICATION_REMARKS

with Dusty's definition:

Term: taxonRank Definition: value to provide for DWC:taxonRank using whatever vocabulary necessary to get things right at iDigBio.

If my Aves had been passed to them with Taxon_Rank "class", they would not be changed to Avus by iDigBio scripts.

dustymc commented 6 years ago

@atrox10 does selecting a taxon_rank for each identification work for you? If there's a consensus I can add that and abandon the script idea.

For future reference, https://www.gbif.org/data-quality-requirements-occurrences#dcTaxonRank list allowable taxon ranks as

kingdom, phylum, class, order, family, genus, species.

anna-chinn commented 6 years ago

I think the issue that Mariel and I are trying to address is in the case of a snake called Echnida, for example. At this juncture, a snake specimen IDed as Echidna won't come up in a search for all vipers or squamates on Arctos or on GBIF/iDigBio (because we automatically populate higher taxonomy fields in our flat table and our Darwin Core Archives per the Arctos classification, right?).

At this point in time, is this kind of case only rectified by adding a second Arctos classification that recognizes Echinda the snake? And then do we accept that both classifications are displayed on a given record (like in the Heteroceras example)?

I like the idea of collection type-specific classifications to solve (at least part of) this problem of homonyms, as you mentioned, Dusty!

dustymc commented 6 years ago

a snake specimen IDed as Echidna won't come up in a search for all vipers or squamates on Arctos

That depends on how you search. If you search the field "family" then only "Muraenidae" will match specimens in collections that prefer "Arctos" taxonomy. If you search "any taxon" then anything on http://arctos.database.museum/name/Echidna will find the specimens (and probably some stuff you didn't want - the "local" classifications should make it easy to remove that though).

GBIF/iDigBio

I can't speak to what they're doing.

fields in our flat table and our Darwin Core Archives per the Arctos classification

Correct, we provide ranked DWC terms from the preferred classification. (And that gets weird if there are !=1 classifications in the preferred classification schema for a name.)

collection type-specific classifications

I don't really care how ya'll partition things out. If you share classifications you'll automagically get updates from other users who use the same classification. If you don't then you won't. (So a new "things paleo folks use" classification would NOT get MVZ's recent updates to birds, for example - and would not need to worry about Aves-the-not-bird conflicting with anything at MVZ either.) I suspect that's something best addressed as a community, or at least a sub-community of similar collections, but there are no functional implications so with whom, if anyone, to share is left entirely up to the individual collections.

Jegelewicz commented 6 years ago

I like the idea of collection type-specific classifications to solve (at least part of) this problem of homonyms, as you mentioned, Dusty!

This would be OK, except for collections that cross higher taxa - a teaching or recent paleo collection for example. I would like to see us be the pioneers who come with a way to handle sharing all taxa. It is an important hurdle that the community needs to figure out.

campmlc commented 6 years ago

Yes, I think ultimately we need to be able to have multiple possible classifications associated with a name and be able to choose or specify which one to assign a particular record. If that is fundamentally not possible, then some collections will have to split off from using Arctos shared taxonomy. But that still doesn't solve Teresa's problem of needing different classifications within a single collection. For now, can we just try to solve the immediate problem of iDigBio by adding the Taxon rank field Teresa has requested, and idigbio highly recommends? John W mentioned a January conference to address these issues in an earlier post. Do we know what resulted from that?

On Wed, Aug 15, 2018, 4:39 PM Teresa Mayfield notifications@github.com wrote:

I like the idea of collection type-specific classifications to solve (at least part of) this problem of homonyms, as you mentioned, Dusty!

This would be OK, except for collections that cross higher taxa - a teaching or recent paleo collection for example. I would like to see us be the pioneers who come with a way to handle sharing all taxa. It is an important hurdle that the community needs to figure out.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1338#issuecomment-413328278, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hIvrJY2gyjd4_iQjik8cjSxbWsL6ks5uRIbjgaJpZM4Qw4oN .

dustymc commented 6 years ago

collections that cross higher taxa

That's not really a problem, unless they have a strong reason to catalog different kinds of specimens under the same name (eg, they hold type material of Echidna-the-snake and Echidna-the-virus, which seems a bit unlikely).

sharing all taxa

There's an old model (https://pdfs.semanticscholar.org/80fe/a7efd0072bde6b640e8c93bddea813d4f436.pdf) which was intended to be a part of Arctos in the early days. It does just that, but it has two small problems:

1) It deals in taxon concepts, and we generally do not. 2) Building the transitive closure node requires more storage than we can readily access (10^15-ish rows, IIRC).

fundamentally not possible

Selecting classifications is possible (a bit of a simplification even) from my perspective, I just don't think you can use it.

split off from using Arctos shared taxonomy

There are already multiple local classifications in use - http://arctos.database.museum/info/ctDocumentation.cfm?table=CTTAXONOMY_SOURCE

Teresa's problem

http://arctos.database.museum/name/Heteroceras#Arctos? If so, @anna-chinn can you confirm that http://arctos.database.museum/guid/CHAS:Ento:4857 isn't just a typo? (I'm not sure there's an actual problem.)

can we just try to solve the immediate problem of iDigBio by adding the Taxon rank field

Yes, the question is only in how to do so. I think @atrox10 was looking for "hopefully-close-enough" scripts, that's somehow changed into selecting a taxon rank for every ID (I think).

Jegelewicz commented 6 years ago

"hopefully-close-enough" scripts

Is exactly what iDigBio is doing - I'd like to be better than that.

anna-chinn commented 6 years ago

If so, @anna-chinn can you confirm that http://arctos.database.museum/guid/CHAS:Ento:4857 isn't just a typo? (I'm not sure there's an actual problem.)

Our catalog book says CHAS:Ento:4857 is "Heteroceras angulatum (Meek & Hayden)" (which comes up in publications when Googled). Unfortunately, the specimen is missing, so I can't check any physical material.

EDIT: Though, now that I'm looking closer, is cephalopod, not the beetle and was just assigned to the wrong collection at some point in time... So yes a typo?

dustymc commented 6 years ago

Thanks!

https://repository.si.edu/bitstream/handle/10088/23079/SMC_7_Meek_1864_8_ii-40.pdf?sequence=1&isAllowed=y (and others) strongly suggests that Heteroceras angulatum (Meek & Hayden) is an ammonite. OK to delete the Arthropoda (phylum) classification?

anna-chinn commented 6 years ago

OK to delete the Arthropoda (phylum) classification?

Yes!