Closed dustymc closed 2 years ago
Wouldn't this mean the we might no longer need WoRMS (via Arctos)?
I think we could use this for TPT if they got their taxonomy loaded to Global Names. MSB Para for example might use both TPT and WoRMS. I don't suppose we have the ability yet to prefer a particular source for a particular taxonomic group? E.g. TPT for Arthropoda but WoRMS for Platyhelminthes? For just the TPT scenario, getting them to work with WoRMS would solve this problem.
On Thu, Dec 17, 2020 at 10:43 AM Teresa Mayfield-Meyer < notifications@github.com> wrote:
- [EXTERNAL]*
Wouldn't this mean the we might no longer need WoRMS (via Arctos)?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3311#issuecomment-747592729, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBAJAX4KTZDPZG6ZDS3SVI7MXANCNFSM4U74HTUA .
no longer need WoRMS (via Arctos)
Exactly. What we're doing with WoRMS is a lot of work and still doesn't get the attention it deserves. This would mean we could focus on one API (GN's), instead of something completely different for every "source" that might eventually come along.
prefer a particular source for a particular taxonomic group
The source defines the "taxonomic group" - that's circular, and I don't think it has a technical solution. You can of course prefer whatever you want in whatever order you want, and that can certainly be arranged in such complex ways, but I think it would depend on you finding or maintaining carefully-ranked classifications, not me waving my magic wand around.
no longer need WoRMS (via Arctos)
That could be superb. As you say what we're doing is a lot of work and isn't perfect. As I recall our data comes directly from WoRMS with constant refresh and GN's website says that the WoRMS refresh period is 60 days.
Still, today I had to add three (and more to come probably) taxon names that have aphia IDs in WoRMS but aren't in WoRMS (via Arctos). They were created Nov. 11, 2020. Would I still be able to get the classification by creating the name and refreshing the aphiaID as I did today? And can we still link directly to the WoRMS URL through the "go to arrow"?
Would this automatically match exactly what's in WoRMS. See https://arctos.database.museum/name/Helix%20redassiana for an example where the subgenus is in the WoRMS (via Arctos) classification but not in the name because I can't (or don't know how to) create the name with a subgenus - Helix (Xerophila) redassiana. And, therefore, it shows as "not found" in WoRMS.
Would their taxon status also override our taxon status?. I'm still putting taxon inquirendum in the cloned Arctos remarks because we don't have that in our taxon status tables. Other examples: https://arctos.database.museum/name/Xerophila%20boiteli#WoRMSviaArctos and https://arctos.database.museum/name/Helix%20chadiana. Would I still be able to download directly from WoRMS?
I assume Arctos could still be second in line for taxon classifications that WoRMS doesn't have.
The potential "gotcha" is that this would be explicitly giving up control of classification data
Yes, we had issues at first with finding specimens with new genera or families but having taxonomy curated by experts is worth it for us.
Is GN stable and well funded?
Would I still be able to get the classification by creating the name
Just that.
and refreshing the aphiaID
That would no longer be part of the process.
still link directly to the WoRMS URL
That would be up to GN.
subgenus
GN's problem.
or don't know how to
https://github.com/ArctosDB/arctos/issues/988
Would their taxon status also override our taxon status?
I don't understand what you're asking, but I think the answer is that we'd just have whatever they provide to GN and GN passes on to us.
We'd probably need to talk about what's in some UIs - this would mean we no longer have predictable labels for nonclassification data - but that's pretty trivial.
Would I still be able to download directly from WoRMS?
That's up to WoRMS?
finding specimens with new genera or families
I was thinking more along the lines of "WoRMS decides Helix is a fungus and so all your clams start looking like mushrooms," but whatever a preferred source does would become associated with your records.
You've been living under that model so it's really no change for you, and WoRMS doesn't seem prone to that sort of thing anyway, but this is the main idea that needs understood and accepted before using a "remote" Source, whatever the mechanism. Not understanding this is going to result in frustration or worse, everything else is mostly details.
stable and well funded
Probably at least as much as Arctos? That doesn't concern me very much, it's not like they would or could take any data with them if NSF decides to off them. That would suck, but not in any way that might change anything in your records. I like to think that this sort of inter-dependency helps them avoid that fate, which obviously helps us, but IDK if that's how NSF actually works or not.
or don't know how to
988
Would their taxon status also override our taxon status? I don't understand what you're asking, but I think the answer is that we'd just have whatever they provide to GN and GN passes on to us.
Right now, we only get taxon status if it's in our taxon status code table.
Beyond logistical questions, I don't understand the current or proposed procedure well enough to identify potential issues. If you think it will work better with less hassle, then it's fine to use WoRMS instead of WoRMS (via Arctos).
I am planning to push the taxonomic experts in the TPT to approach WoRMS and manage their stuff there, but that is a long term plan that may take a while and TPT members have to get cataloging NOW. So, I am about to get a move on the TPT taxonomy in Arctos. I have two choices for dealing with it:
Eventually, ALL TPT members will be using the source at WoRMS (I hope), but at least could use Global Names until we can get that ball rolling. That doesn't really make any difference to Arctos members though unless we can prefer a Global Names taxonomy because I would still have to enter the classifications into Arctos in order for them to be useful to Arctos collections.
For now, I am only going to be loading names that have successfully passed the Global Names validator to Arctos, but it won't be long before I need to get clarity on the associated classification handling. @dustymc @campmlc you guys need to be involved in the decision.
Is WoRMS interested in that sort of thing? Past interactions suggest they have little or no interest in anything that doesn't REALLY like saltwater. If they are, and depending on how things there work, maybe they'd be interested in essentially acting as a UI for other Arctos collections who want to manage some block of taxa for some reason?
For now I think that's irrelevant from here, it doesn't matter how the data gets to GN.
I haven't had time to play with this more, but I still don't see any major obstacles to allowing remote classifications. Anyone using those needs to fully understand what it means for their data, but that's just a social issue.
The situation with subgenus nicely illustrates the one technical(ish) issue I can see with remote classifications - I don't think I'll find things that don't match Arctos names, and the Arctos community has very wisely decided to keep names clean (https://github.com/ArctosDB/arctos/issues/1704). Again, that's something which GN could in theory address - they already do some fuzzy matching, I think extracting the "pure" name from some "traditional" format is just more of the same.
The solution to any situation where Arctos can't find data in GN for some reason is and will remain to just manage your own local classification, or use something else that does so. If strange not-quite-taxon-names are uncommon (and I think they are in TPT) then that's probably fairly minimal.
Is WoRMS interested in that sort of thing?
Yes, I believe they are. Marine environments are not walled off from the rest of the living world.....
depending on how things there work, maybe they'd be interested in essentially acting as a UI for other Arctos collections who want to manage some block of taxa for some reason?
The only reason I think they would accept is TAXONOMIC EXPERTISE, but that is ostensibly what the TPT Taxonomy has...
The content of WoRMS is controlled by taxonomic and thematic experts, not by database managers. WoRMS has an editorial management system where each taxonomic group is represented by an expert who has the authority over the content, and is responsible for controlling the quality of the information. Each of these main taxonomic editors can invite several specialists of smaller groups within their area of responsibility to join them.
WoRMS has extended their molluscan taxonomy to terrestrial and freshwater species because the editors who were doing the marine mollusa are now editing MolluscaBase which is more comprehensive and feeds into WoRMS. The same thing could probably happen for other groups if they approach WoRMS with their expertise.
WoRMS is superior to GN because it is edited and GN seems to take in anything and send it back out without any review. That includes taxonomic messes we created before WoRMS was available.
Diptera
There are flies that like brackish water. I don't think WoRMS extends to all dipterans, but it might be cool if they did.
I assume that anyone wanting to curate their own taxonomy would bring expertise to the table, but I suppose that would be between them and WoRMS.
WoRMS is superior to GN
Apples are superior to oranges - unless you're trying to make orange juice...
WoRMS is a taxonomy database, or something like it. They're great, if you share their opinions. They're horrible and evil if you don't, and Arctos can't be in the position of trying to tell Curators what their taxonomic opinions should be.
GN is (from my perspective, which doesn't encompass everything) a very nice way to share data. They inject no opinions in that process, which is what makes them useful to us. We don't just use any of the big "taxonomy databases" (which was the plan for quite some time) because some Curator thinks they're wrong (or at least outdated) about something.
One's not better than the other, they're just very different things which serve very different purposes.
GN makes data from folks like WoRMS more accessible to us, and our current structure lets collections use what they want and override anything they don't agree with. Yay everybody!
Any chance we could start with PBDB as a test case for this? Taxonomy Committee has concerns about losing the link to WoRMS without knowing the stability of GN.
test case
I'm not sure there's such a thing - we turn it on or we don't.
stability of GN
I'm not concerned. We lose nothing if they disappear tomorrow, we'd just stop getting updates. Worst case from there, we'd re-do what we need of what they've done (which would likely take resources similar to what they've invested). Best case - perhaps even most realistic - someone scoops their code up (I think it's all on github) and we change one URL in a config file.
Estimate on how long this would take to set up in test?
Probably about a day, if my vague plan holds up to reality.
But we need a fair amount of testing time - that OK? I feel like we would need to prod people to go mess around with it and make sure they would be happy.
Vague plans again, but I think the only thing you'd see is a new column in a code table until it's used. (Then you'd just get different data magicked into flat.)
Do note that flat contains things like a column called "family" which holds "all the terms used as rank=family for the preferred classification" and PDBD ain't got no family - http://test.arctos.database.museum/name/Microtus%20oeconomus#ThePaleobiologyDatabase. Preferring that would mean a lot of NULLs and a big whoppin' mess in full_taxon_name. The old search fields....
would never find anything, but the new ones....
... would, as long as you don't try searching for things that don't exist (like "family according to PBDB").
Yeah and that's what I want people to mess with to make sure they won't be disappointed, although I guess they could always just return to the Arctos source or whatever if they don't like it....
My main concern for testing would be WoRMS and how it would compare to WoRMS (via Arctos) as it is now. @sharpphyl and I are concerned about the frequency with which GN updates with WoRMS. Once a year is not enough, maybe once a month would work, but that also means we might be asking too much of GN. We just don't want to lose any of the current WoRMS (via Arctos) functionality in this process.
Do note that flat contains things like a column called "family" which holds "all the terms used as rank=family for the preferred classification" and PDBD ain't got no family
So the ranks that show up in the PBDB classifications aren't searchable by rank?
Well how about that! I think my brain drowned in the sea of 'unranked clade'.
Anything with a rank (including those 'unranked clade') is searchable by rank in the "new form."
Only stuff that lands in FLAT (and is most-preferred by the collection) is searchable by the "old form."
Flat looks like this.
arctosprod@arctosutf>> \d flat
Table "public.flat"
Column | Type | Collation | Nullable | Default
-------------------------------+-----------------------------+-----------+----------+---------
collection_object_id | bigint | | not null |
cat_num | character varying(40) | | |
accn_id | bigint | | not null |
collection_id | bigint | | not null |
institution_acronym | character varying(20) | | |
collection_cde | character varying(5) | | |
collection | character varying(50) | | |
collecting_event_id | bigint | | |
verbatim_date | character varying(60) | | |
last_edit_date | timestamp without time zone | | |
individualcount | bigint | | |
coll_obj_disposition | character varying(20) | | |
collectors | character varying | | |
field_num | character varying | | |
othercatalognumbers | character varying | | |
genbanknum | character varying | | |
relatedcatalogeditems | character varying | | |
typestatus | character varying | | |
sex | character varying | | |
parts | character varying(4000) | | |
encumbrances | character varying | | |
accession | character varying(81) | | |
geog_auth_rec_id | bigint | | |
higher_geog | character varying(255) | | |
continent_ocean | character varying(50) | | |
country | character varying(50) | | |
state_prov | character varying(75) | | |
county | character varying(50) | | |
feature | character varying(50) | | |
island | character varying(50) | | |
island_group | character varying(50) | | |
quad | character varying(30) | | |
sea | character varying(50) | | |
locality_id | bigint | | |
spec_locality | character varying(255) | | |
minimum_elevation | double precision | | |
maximum_elevation | double precision | | |
orig_elev_units | character varying(2) | | |
min_elev_in_m | double precision | | |
max_elev_in_m | double precision | | |
dec_lat | double precision | | |
dec_long | double precision | | |
datum | character varying(55) | | |
orig_lat_long_units | character varying(20) | | |
verbatimlatitude | character varying(127) | | |
coordinateuncertaintyinmeters | double precision | | |
identification_id | bigint | | |
scientific_name | character varying(255) | | |
identifiedby | character varying | | |
date_made_date | timestamp without time zone | | |
remarks | character varying | | |
habitat | character varying | | |
associated_species | character varying | | |
taxa_formula | character varying(25) | | |
full_taxon_name | character varying | | |
phylclass | character varying | | |
kingdom | character varying | | |
phylum | character varying | | |
phylorder | character varying | | |
family | character varying | | |
genus | character varying | | |
species | character varying | | |
subspecies | character varying | | |
author_text | character varying | | |
nomenclatural_code | character varying | | |
infraspecific_rank | character varying | | |
identificationmodifier | character(1) | | |
guid | character varying(67) | | |
basisofrecord | character varying(17) | | |
depth_units | character varying(20) | | |
min_depth | double precision | | |
max_depth | double precision | | |
min_depth_in_m | double precision | | |
max_depth_in_m | double precision | | |
collecting_method | character varying | | |
collecting_source | character varying(15) | | |
dayofyear | bigint | | |
age_class | character varying | | |
attributes | character varying | | |
verificationstatus | character varying(40) | | |
specimendetailurl | character varying(255) | | |
imageurl | character varying(121) | | |
fieldnotesurl | character varying(121) | | |
catalognumbertext | character varying(40) | | |
collectornumber | character varying | | |
verbatimelevation | character varying(84) | | |
year | bigint | | |
month | bigint | | |
day | bigint | | |
stale_flag | bigint | | not null | 0
lastuser | character varying(38) | | |
lastdate | timestamp without time zone | | |
partdetail | character varying | | |
began_date | character varying(22) | | |
ended_date | character varying(22) | | |
id_sensu | character varying(255) | | |
preparators | character varying | | |
verbatim_locality | character varying | | |
made_date | character varying(22) | | |
event_assigned_by_agent | character varying(255) | | |
event_assigned_date | timestamp without time zone | | |
specimen_event_remark | character varying | | |
specimen_event_type | character varying(60) | | |
coll_event_remarks | character varying | | |
verbatim_coordinates | character varying(255) | | |
collecting_event_name | character varying(255) | | |
georeference_source | character varying | | |
georeference_protocol | character varying(255) | | |
locality_name | character varying(255) | | |
enteredby | character varying(255) | | |
entereddate | timestamp without time zone | | |
flags | character varying(255) | | |
nature_of_id | character varying(255) | | |
cataloged_item_type | character varying(20) | | |
previousidentifications | character varying | | |
use_license_url | character varying | | |
identification_remarks | character varying | | |
locality_remarks | character varying | | |
formatted_scientific_name | character varying(255) | | |
subfamily | character varying(255) | | |
tribe | character varying(255) | | |
subtribe | character varying(255) | | |
ispublished | character varying(10) | | |
has_tissues | bigint | | |
taxon_rank | character varying(255) | | |
last_edited_table | character varying(255) | | |
locality_search_terms | character varying | | |
json_locality | character varying | | |
related_record_cache | character varying | | |
attributedetail | jsonb | | |
According to this: http://gni.globalnames.org/data_sources/12 it looks like GN updates WORMs every 60 days?
@sharpphyl look at the previous post in this thread.
GN may only update WoRMS every 60 days, but what we're doing today with WoRMS (via Arctos) is by no means updated daily or more frequently. I've spent the past few days adding taxon names every few minutes - mostly terrestrial species which WoRMS (via MolluscaBase) is adding regularly. The date of the last revision in WoRMS doesn't really predict whether or not it's in WoRMS (via Arctos) as some names added in July 2020 still weren't in WoRMS (via Arctos). Tylotoechus tchehelensis was added on 2/14/21 so I'm not surprised it's not yet in WoRMS (via Arctos). But Cochlostyla solida was added on 10-1-2-20 so it should be in, but I added it this morning. Dusty probably has some magic SQL that would identify what's been added recently - and that's just what I need for the collection.
Additionally, the refresh isn't happening with any frequency. I have Isognomon in both Isognomonidae and in Pteriidae. If I manually refresh the species entries, it updates the family.
Additionally, some taxon names never make it to WoRMS (via Arctos).
Here's what's in WoRMS for Anostoma (13 direct children including 2 that are unaccepted)
Here's what's in WoRMS (via Arctos)
And the last update to this page was in 2017
But when I look in GNI for Anostoma, despite there being 52 entries, none of them show WoRMS as the source.
Weirdest of all, when I found a WoRMS link for Isognomon isognomum this is what it shows.
Can we confirm that GNI is getting their data directly from WoRMS or is it going through another source first?
WoRMS (via Arctos) is such a massive improvement over Arctos that I'm more than willing to deal with these problems, but other collections might not be. And if using GN to access WoRMS solves these issues and the others I've mentioned - such as subgenera and taxon status - then I'd like to try it in test or someplace where I can return to WoRMS (via Arctos) without disrupting anything.
Searching GN to see what they have in the database doesn't help much, so I think I'd have to experiment in test to know for sure.
I was in a Zoom with Dimus from GN on Monday - he says he only updates twice a year....
Twice a year isn't adequate, but what we're doing now seems to be several months delay.
This should probably be in a new thread - using data from GN instead of directly from WoRMS becomes possible if we open up the model, but opening up the model doesn't necessarily mean we have to do anything with the WoRMS API.
Conversely, allowing non-local sources may mean we have to allow "not ours" term types and that could affect how we "translate" (eg we might no longer be required to) from WoRMS API, so there is some overlap between discussions.
several months delay
That could just mean that the scripts are broken. The API link could probably use 6 months of uninterrupted work, then really should get a few hours per week to make sure it's happy (or have better monitoring scripts, or both, or SOMETHING). That's all greatly complicated by not having a fully-functional test environment. GN generally just works (in part because it's a simpler approach). That certainly doesn't mean we must pipe everything through GN, just that there are different resource requirements.
@dustymc I'm reluctant to be the test on this, especially if GN's refresh is less frequent than ours. I've manually added and updated the names that we are using. Their date of entry or update in WoRMS varied from October 2020 to February 2021.
Also, I have eliminated everything in WoRMS (via Arctos) without an aphiaID. And if I understand your comment above, the aphiaID would no longer be part of the process. That's how I access the current classification so losing that would be very significant. I also love that I can go from the WoRMS (via Arctos) directly to the WoRMS page.
Is there any way to refresh just the Mollusca (171,548 out of 698,362 entries) sometime, and let the rest of the classifications catch up over time?
@Jegelewicz Maybe you have another taxonomy that can be a better test. It sounds good in concept but some of the details are a concern.
Maybe you have another taxonomy that can be a better test.
The original impetus for pushing this was that it would be helpful for NMMNH to use the paleobiology database taxonomy as a source, but from what Dusty has said it is all GN or nothing.
it is all GN or nothing.
That's not at all what I'm trying trying to say.
There's an AWG Meeting on the 11th - can we get this on the agenda and talk about it there?
@dustymc should this be discussed in the issues meeting this Thursday instead?
I don't think so, the social issues are really what need discussed.
How's this breaking existing functionality?
Just making it critical so it will get onto the agenda...
Are we mis-using projects? Need another label? I have a lot of stuff set up to help me promptly respond to issues that break existing functionality, none of that can work as it should if my filters are being abused.
The issues agenda is taken directly from the list of issues with those marked as "critical" at the top - we've been doing that for many months now - not sure why it is just now an issue.
Maybe a Priority-Discussion Label?
Needs technical discussion with Dusty.
TPT use case - find issue and let's try it.
I have no idea what's going on here or why this is on the AWG agenda, this seems pretty straightforward to me - someone wants to use an external source and is willing to understand what that means functionally, or not. Moving to discussions pendings some actionable request.
ref: https://github.com/ArctosDB/arctos/issues/1641#issuecomment-747064815
The potential "gotcha" is that this would be explicitly giving up control of classification data. If your collection prefers SomeSource, and SomeSource does something unpredictable, then your collection's data will follow. I believe this is just a matter of documentation.
Would anyone other than @Nicole-Ridgwell-NMMNHS use this?