ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
Apache License 2.0
57 stars 13 forks source link

WoRMS taxa with two or more classifications #2007

Closed sharpphyl closed 5 years ago

sharpphyl commented 5 years ago

In issue #1841, Dusty created a list of WoRMS (via Arctos) taxa with two classifications temp_wrms_mid(1).csv.zip.

There are 1,124 taxon names on this list. I have run them through a WoRMS match for ambiguous names and worked on the others directly. The issues I'm finding fall into several categories.

  1. WoRMS HAS MORE THAN ONE CLASSIFICATION. WoRMS taxon name is "ambiguous." Example: Zospeum lautum with aphiaID's 1001477 and 1001476. (I don't see any difference so this may be an error in WoRMS.)

Screen Shot 2019-03-25 at 4 01 36 PM

Both aphiaIDs are transferred to Arctos. I assume this is because we are organized by taxon name and not by aphiaID.

Screen Shot 2019-03-25 at 4 05 25 PM

An example of two dissimilar classifications in WoRMS is Alveolina fusiformis

Screen Shot 2019-03-25 at 4 09 54 PM

Three aphiaIDs are associated with the name: 903849, 903848, 903850.

Two of them 903848 and 903850 uploaded to Arctos.

Screen Shot 2019-03-25 at 4 12 35 PM

I deleted one yesterday not remembering that we refresh WoRMS continually, so by morning both were back.

I don't see any solution to these ambiguous (multi-classification) taxa in WoRMS (via Arctos) until we have a solution through the "taxon concept" process. Fortunately, we aren't using most of these right now, though I'm only 1/10th of the way through the list. There are over 600 of these ambiguous taxa. One interim suggestion would be create one more metadata field for Taxon Name Status - yes or no, ambiguous.

  1. ARCTOS DUPLICATES THAT NOW HAVE WoRMS ENTRIES. Somehow back on 2018-12-12 many duplicates were created with an hour of each other. Here is Bulinus globosus .

Screen Shot 2019-03-25 at 4 22 15 PM

I suspect that when we first uploaded WoRMS (via Arctos) there was no entry for this species. The WoRMS entry is dated 2019-02-07 with an update today. AphiaID 1058683.

Screen Shot 2019-03-25 at 4 23 51 PM

As soon as I add the AphiaID and refresh the entry cleans up beautifully, but it's a very time-consuming process and I suspect we'll have to do this repeatedly as new taxa are added to WoRMS.

Screen Shot 2019-03-25 at 4 25 52 PM

Dusty, you indicated in the previous issue that you don't have a system to find new taxa (aphiaIDs) in WoRMS and overwrite the existing classification. Do you have the ability to add everything that WoRMS adds and have it be a second classification within WoRMS (via ARctos) and then we can manually delete the old classification by looking for new WoRMS classifications within the "multi-classification" list? Other ideas?

  1. ARCTOS DUPLICATES NOT IN WoRMS. There is a miscellaneous category where, again, two records for WoRMS (via Arctos) were created within one hour (or minute) of each other on 2018-12-12. In some cases, they are identical and need no editing; I just delete one of them. See Calocochlia pan

Screen Shot 2019-03-25 at 4 36 54 PM

In other cases, they are weirdly dissimilar.

Screen Shot 2019-03-25 at 4 33 43 PM

So far, enough of the records have needed clean-up (typically adding ICZN or author or source) that, while it's tempting to have Dusty do some type of magic to get rid of the second classification, it's better to follow the trail to the messes that need attention.

Overall everything is going very well, but these few bumps need attention.

Priority is ultimately high but not urgent as I can clean up the records we're using in a few weeks.

dustymc commented 5 years ago

WoRMS HAS MORE THAN ONE CLASSIFICATION. WoRMS taxon name is "ambiguous." Example: Zospeum lautum with aphiaID's 1001477 and 1001476. (I don't see any difference so this may be an error in WoRMS.)

Just pick one - their duplicates (of any flavor) aren't our problem (unless we bring both IDs in).

Both aphiaIDs are transferred to Arctos. I assume this is because we are organized by taxon name and not by aphiaID.

No, the worms link is by aphiaID. I just checked my code again, found a place where it's conceivable that I could have done this if they update in a certain way. That's patched.

An example of two dissimilar classifications in WoRMS is Alveolina fusiformis Three aphiaIDs are associated with the name: 903849, 903848, 903850. Two of them 903848 and 903850 uploaded to Arctos. I deleted one yesterday not remembering that we refresh WoRMS continually, so by morning both were back.

I'm not sure why that would happen. I deleted author_text: Stache, 1883

/remind me to check http://arctos.database.museum/name/Alveolina%20fusiformis#WoRMSviaArctos tomorrow.

I don't see any solution to these ambiguous (multi-classification) taxa in WoRMS (via Arctos) until we have a solution through the "taxon concept" process. Fortunately, we aren't using most of these right now, though I'm only 1/10th of the way through the list. There are over 600 of these ambiguous taxa. One interim suggestion would be create one more metadata field for Taxon Name Status - yes or no, ambiguous.

If you eg, have the holotype for both of them, then taxon concepts could do something moderately useful here. If you don't, just delete the one you're not using. (And if my scripts are creating these I'll figure that out and stop it eventually...)

Dusty, you indicated in the previous issue that you don't have a system to find new taxa (aphiaIDs) in WoRMS and overwrite the existing classification. Do you have the ability to add everything that WoRMS adds and have it be a second classification within WoRMS (via ARctos) and then we can manually delete the old classification by looking for new WoRMS classifications within the "multi-classification" list? Other ideas?

That's well within my definition of evil - it would make your currently-unambiguous data ambiguous. I should be able to somehow notify you if a new aphiaID pops up for an existing name that doesn't have one or something??

reminders[bot] commented 5 years ago

@dustymc set a reminder for Mar 26th 2019

reminders[bot] commented 5 years ago

:wave: @dustymc, check http://arctos.database.museum/name/Alveolina%20fusiformis#WoRMSviaArctos .

sharpphyl commented 5 years ago

Good. Just one classification so far.

On Tue, Mar 26, 2019 at 3:13 AM reminders[bot] notifications@github.com wrote:

šŸ‘‹ @dustymc https://github.com/dustymc, check http://arctos.database.museum/name/Alveolina%20fusiformis#WoRMSviaArctos .

ā€” You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2007#issuecomment-476536474, or mute the thread https://github.com/notifications/unsubscribe-auth/AOqArRvTvPsITnmz5ieD4H5Cf1MB7Fg9ks5vaeTMgaJpZM4cJzFS .

dustymc commented 5 years ago

Yea I'm still thinking that's my fault, and happens when a name with multiple classifications shows up in worms and I happen to get them both in the same "chunk" via the API. Should be fixed - I'll now just create the first one I get. On that note, I remade the >1 classification thing, then split them into names you use and those you don't.

drop table temp_wrms_tid;

create table temp_wrms_tid as select taxon_name_id, classification_id from taxon_term where source='WoRMS (via Arctos)' group by taxon_name_id,classification_id;

drop table temp_wrms_mid;

create table temp_wrms_mid as select taxon_name_id from temp_wrms_tid having count(*) > 1 group by taxon_name_id;

alter table temp_wrms_mid add sciname varchar2(255);

update temp_wrms_mid set sciname=(select scientific_name from taxon_name where taxon_name.taxon_name_id=temp_wrms_mid.taxon_name_id);

alter table temp_wrms_mid add aidlist varchar2(255);

declare 
  a varchar2(255);
  sp varchar2(255);
begin
  for r in (select * from temp_wrms_mid) loop
    a:=NULL;
    sp:='';
    for c in (select term from taxon_term where taxon_name_id=r.taxon_name_id and source='WoRMS (via Arctos)' and term_type='aphiaid') loop
        a:=a||sp||c.term;
        sp:=';';
    end loop;
    update temp_wrms_mid set aidlist=a where taxon_name_id=r.taxon_name_id;
  end loop;
end;
/

-- used by DMNS
create table temp_wrms_dup_usd as select taxon_name_id, scientific_name from taxon_name where
taxon_name_id in 
(
  select identification_taxonomy.taxon_name_id from identification_taxonomy,identification,cataloged_item,collection,temp_wrms_mid
  where
  temp_wrms_mid.taxon_name_id=identification_taxonomy.taxon_name_id and
  identification_taxonomy.identification_id=identification.identification_id and
  identification.collection_object_id=cataloged_item.collection_object_id and
  cataloged_item.collection_id=collection.collection_id and
  guid_prefix='DMNS:Inv'
);
-- not
create table temp_wrms_dup_notusd as select taxon_name_id, scientific_name from taxon_name where
taxon_name_id in (select taxon_name_id from temp_wrms_mid) and taxon_name_id not in (select taxon_name_id from temp_wrms_dup_usd)
;

temp_wrms_mid(2).csv.zip temp_wrms_dup_notusd.csv.zip temp_wrms_dup_usd.csv.zip

A person needs to deal with the used stuff but if you want I can go through the unused and just save a random aphiaid - for those that I made that's essentially what I should have done when I created them anyway.

sharpphyl commented 5 years ago

Thank you! I'll get the ones we use fixed first, then the rest of them. Then we can figure out how to deal with aphiaIDs added since the first upload.

On Tue, Mar 26, 2019 at 8:16 AM dustymc notifications@github.com wrote:

Yea I'm still thinking that's my fault, and happens when a name with multiple classifications shows up in worms and I happen to get them both in the same "chunk" via the API. Should be fixed - I'll now just create the first one I get. On that note, I remade the >1 classification thing, then split them into names you use and those you don't.

drop table temp_wrms_tid;

create table temp_wrms_tid as select taxon_name_id, classification_id from taxon_term where source='WoRMS (via Arctos)' group by taxon_name_id,classification_id;

drop table temp_wrms_mid;

create table temp_wrms_mid as select taxon_name_id from temp_wrms_tid having count(*) > 1 group by taxon_name_id;

alter table temp_wrms_mid add sciname varchar2(255);

update temp_wrms_mid set sciname=(select scientific_name from taxon_name where taxon_name.taxon_name_id=temp_wrms_mid.taxon_name_id);

alter table temp_wrms_mid add aidlist varchar2(255);

declare a varchar2(255); sp varchar2(255); begin for r in (select * from temp_wrms_mid) loop a:=NULL; sp:=''; for c in (select term from taxon_term where taxon_name_id=r.taxon_name_id and source='WoRMS (via Arctos)' and term_type='aphiaid') loop a:=a||sp||c.term; sp:=';'; end loop; update temp_wrms_mid set aidlist=a where taxon_name_id=r.taxon_name_id; end loop; end; /

-- used by DMNS create table temp_wrms_dup_usd as select taxon_name_id, scientific_name from taxon_name where taxon_name_id in ( select identification_taxonomy.taxon_name_id from identification_taxonomy,identification,cataloged_item,collection,temp_wrms_mid where temp_wrms_mid.taxon_name_id=identification_taxonomy.taxon_name_id and identification_taxonomy.identification_id=identification.identification_id and identification.collection_object_id=cataloged_item.collection_object_id and cataloged_item.collection_id=collection.collection_id and guid_prefix='DMNS:Inv' ); -- not create table temp_wrms_dup_notusd as select taxon_name_id, scientific_name from taxon_name where taxon_name_id in (select taxon_name_id from temp_wrms_mid) and taxon_name_id not in (select taxon_name_id from temp_wrms_dup_usd) ;

temp_wrms_mid(2).csv.zip https://github.com/ArctosDB/arctos/files/3008517/temp_wrms_mid.2.csv.zip temp_wrms_dup_notusd.csv.zip https://github.com/ArctosDB/arctos/files/3008518/temp_wrms_dup_notusd.csv.zip temp_wrms_dup_usd.csv.zip https://github.com/ArctosDB/arctos/files/3008519/temp_wrms_dup_usd.csv.zip

A person needs to deal with the used stuff but if you want I can go through the unused and just save a random aphiaid - for those that I made that's essentially what I should have done when I created them anyway.

ā€” You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2007#issuecomment-476665333, or mute the thread https://github.com/notifications/unsubscribe-auth/AOqArYB1iw3raUA40kC3JZACApGO8YY0ks5vaivRgaJpZM4cJzFS .

sharpphyl commented 5 years ago

I'm confused. The first (alphabetic) taxon on your "used" list is _Ampelita sepulcralis Arctos ID 10968435. There are two classifications which are identical except they were created about an hour apart on 2018-12-12.

Screen Shot 2019-04-01 at 9 55 30 AM

The WoRMS match tool shows an "exact" and "accepted" match with aphiaID 1253479. As soon as I add the WoRMS aphiaID and refresh I get the full classification.

Screen Shot 2019-04-01 at 9 56 18 AM

There are about 140 of the 333 used taxa with two classifications that have an accepted WoRMS classification. Do we know how this happened and is there any way to update them to the WoRMS classification more quickly than my slow manual approach?

sharpphyl commented 5 years ago

I tried to do a bulk upload of the 48 taxa that are being used and have an exact WoRMS match. I think I only need the SCIENTIFIC_NAME and the APHIAID (plus my name and the Source). I get this error message for 47 of them

Status Count

scientific_name mismatch@ ()

Is this because there are two classifications. So is there anyway to override this and get the aphiaID into these records via bulkload of some type or is it back to manually doing it one by one?

dustymc commented 5 years ago

bulk upload

Data please.

sharpphyl commented 5 years ago

USED taxon to add aphiaID 1 Apr 2019.xlsx

dustymc commented 5 years ago

Ah - thanks. Apparently the loader is paranoid about tossing that much data. If you'll attach CSV here I can delete anything in WoRMS (via Arctos) for those names and reseed with the aphiaid.

sharpphyl commented 5 years ago

It won't let me attach a CSV so I'll email it if that's ok with you. Thanks.

dustymc commented 5 years ago

You can ZIP anything and attach, or email is fine too.

sharpphyl commented 5 years ago

Let me know if it didn't come through the email link to the GitHub issue.

On Mon, Apr 1, 2019 at 4:10 PM dustymc notifications@github.com wrote:

You can ZIP anything and attach, or email is fine too.

ā€” You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2007#issuecomment-478765970, or mute the thread https://github.com/notifications/unsubscribe-auth/AOqAraVtjvI0p5eNp3kuY1brFw2PibE6ks5vcoPIgaJpZM4cJzFS .

dustymc commented 5 years ago

nope, no attachment

dustymc commented 5 years ago

The email did it. Here's the code in case you need this again.

declare
  tid number;
  cid varchar2(255);
begin
  for r in (select * from temp_aphiaids) loop
    select taxon_name_id into tid from taxon_name where scientific_name=r.SCIENTIFIC_NAME;
    -- nuke it from orbit
    delete from taxon_term where source='WoRMS (via Arctos)' and taxon_name_id=tid;
    -- new ID
    SELECT SYS_GUID() into cid FROM dual;
    -- new aphiaiD is all we need
    insert into taxon_term (
        TAXON_TERM_ID,
        TAXON_NAME_ID,
        CLASSIFICATION_ID,
        TERM,
        TERM_TYPE,
        SOURCE,
        LASTDATE
      ) values (
        sq_taxon_term_id.nextval,
        tid,
        cid,
        r.aphiaid,
        'aphiaid',
        'WoRMS (via Arctos)',
        sysdate
      );
  end loop;
end;
/
sharpphyl commented 5 years ago

Can you refresh all of them, or does it matter? The aphiaID is in but they haven't refreshed yet so there's no classification. There are only 48 so I can refresh manually if need be.

dustymc commented 5 years ago

Doesn't matter, but I moved them to the front of the queue.

update cf_worms_refreshed set status='needs_refreshed'  where APHIAID in (select APHIAID from temp_aphiaids);
insert into cf_worms_refreshed (APHIAID,NAME,STATUS,TAXON_NAME_ID,KEY) (select APHIAID, SCIENTIFIC_NAME,'needs_refreshed',(select taxon_name_id from taxon_name where taxon_name.scientific_name=temp_aphiaids.SCIENTIFIC_NAME), somerandomsequence.nextval from temp_aphiaids where  APHIAID not in (select APHIAID from cf_worms_refreshed));
sharpphyl commented 5 years ago

So I'm working through the taxa in WoRMS (via Arctos) with multiple classifications and I find that I need revise the records using the identification Muricodrupa fiscellum aphia_ID 408067 to Murichorda fiscellum AphiaID 1335428. The record for Murichorda fiscellum shows that it was entered on 2019-03-23 and it is not yet in WoRMS (via Arctos).

Screen Shot 2019-04-05 at 10 39 57 AM

Should I wait and let your regular addition of new WoRMS taxa create the entry? If I go ahead and create it with the aphiaID will that screw up your updates or create a duplicate classification? Not sure I totally understand how the process works to get taxa newly added to WoRMS into WoRMS (via Arctos). I think you come through a webservice (not directly from WoRMS?) so there is some delay. As long as everything WoRMS adds shows up eventually, I'm ok. I'll just delay my updates.

Would you prefer this be a separate issue? It's part of the overall clean-up and understanding of how WoRMS (via Arctos) works.

dustymc commented 5 years ago

The service IS WoRMS. AFAIK they use the same thing to build their front end. There should be very little delay, although if they add a million names or something it'll probably take a while to catch up.

I'm not sure why that name didn't get created in Arctos when it appeared in WoRMS. I had a quick glance at the scripts and don't see any obvious reason it'd fall out, but I didn't get it. I'll poke around some more. You can't break anything by manually creating it.

sharpphyl commented 5 years ago

Ok, I'll go ahead and create it so I don't forget to update the specimen record.

dustymc commented 5 years ago

https://github.com/ArctosDB/arctos/issues/1704

dustymc commented 5 years ago

Longer version: Arctos can deal with subgenera, just not as weird embedded "traditional" strings that cause functional duplicates and make sure nobody can find what they're looking for.

I went to http://arctos.database.museum/name/Cancellaria%20solida

deleted one of the classifications

edited the other to add..

Screen Shot 2019-04-05 at 11 58 41 AM

refresh/reload

Screen Shot 2019-04-05 at 11 59 31 AM
dustymc commented 5 years ago

I'm now sure why the name didn't get created.

http://www.marinespecies.org/rest/AphiaRecordByAphiaID/464689

"scientificname": "Cancellaria (Pyruclia) solida",

UAM@ARCTOS> select isvalidtaxonname('Cancellaria (Pyruclia) solida') from dual;

ISVALIDTAXONNAME('CANCELLARIA(PYRUCLIA)SOLIDA')
------------------------------------------------------------------------------------------------------------------------
Invalid characters.

Arctos is not going to pick up subgeneric names (or anything else with parentheses-n-such, which I think is not limited to subgenus in WoRMS).

sharpphyl commented 5 years ago

Let me make sure I understand. So the taxon name will be Cancellaria solida (no subgenus) and the WoRMS classification will end with Cancellaria (Pyruclia) solida?

dustymc commented 5 years ago

yup - http://arctos.database.museum/name/Cancellaria%20solida#WoRMSviaArctos

sharpphyl commented 5 years ago

Back to taxa with two or more classifications. Is it possible to create a report in "low quality data" that enables us to check within our Source for taxon names with more than one classification? Same as the data you've run for me in this issue but for ongoing cleanup.

dustymc commented 5 years ago

report

Not easily because of the cost of the queries involved. I'm happy to re-run manually as needed, and we can investigate a cache or something if that's not sufficient.

sharpphyl commented 5 years ago

That would be fine; I was just trying to not bother you for everything. I see some SQL code at the beginning of this thread. If that hasn't changed, could I run that occasionally and get the same info?

dustymc commented 5 years ago

The code is https://github.com/ArctosDB/arctos/issues/1841#issuecomment-473989098 but it's likely to time out in the web tools (and some of it's in a language the web tools don't speak).

It's no bother. I'd rather give you tools so you can do this whenever you want, but this isn't that trivial and hopefully is a short-term problem. Let me know if asking for data does become an issue and we'll figure something else out.

sharpphyl commented 5 years ago

Ok. It will be several weeks before I'm ready for a rerun of the code to see if I got everything. I'm knocking out the duplicates but it's a slow process as some of the duplicates need more than a bit of editing.

sharpphyl commented 5 years ago

Ok, I think I've eliminated all the duplicate classification in WoRMS (via Arctos). @dustymc can you run the SQL again and see if I missed anything or anything new has snuck in? If it looks good, we can close this issue.

Just for the record, of the 1,124 taxa on the initial list, 245 of them were not homonyms but a duplicate classification created on 12 Dec 2018 probably when we were doing the initial upload. They were mostly non-WoRMS taxa that I had added to WoRMS (via Arctos) that somehow got in twice.

An additional 212 had a WoRMS aphiaID (based on using the WoRMs match tool) but had some other issue such as being unaccepted or misspelled that might have caused a burp. If there was an aphiaID for an accepted taxon, I kept that one.

That left 667 that are honomyns in WoRMS, most of which no one is using. Of them more than half are Chromista, more than a quarter are Plantae, and the rest are a mixture of 9 other categories. If there was a date on the authority, I kept the earlier one. Otherwise, I just deleted one of them.

dustymc commented 5 years ago

temp_wrms_mid(3).csv.zip

sharpphyl commented 5 years ago

OK, let's try it one more time and see if I got everything. Thanks!

I'm thinking we should run this every 6-12 months. Do you have a queue for such reports?

dustymc commented 5 years ago

Yup, 0 rows updated. Yay!!

No, there's no such queue. I suppose I could somehow set one up, but I'm not sure how this could happen other than eg, trying to figure something entirely new out (which is where I think these all came from). Definitely new Issue if you want to go there.