Closed mvzhuang closed 4 months ago
Is there an easy way of loading this into Arctos?
1 would be good.
Yes (1) is great (and technically trivial on this end, I think) - but https://arctos.database.museum/name/Alnus isn't very promising, might be worth contacting them?
Dima at GN is easy to work with, but I can't imagine how this could be something he can deal with.
A million names is probably not going to make it through the UI/uploader either. I could (probably) pull directly into Arctos, but I don't really want another job if I can avoid it....
@mvzhuang how about we reduce the names to those currently in use - then classifications can be added for names as they are needed? This is what I have been doing for TPT Acari stuff.
Then I would just need to figure out a way to get the classifications for only those names...
Are people using R to manipulate stuff like this instead of Excel for these large sources?
I could try to do this and if I can make it work, I could then limit the classifications to the terms you are currently using - but then there are relationships between names to deal with (I have passed over this for TPT, but I am sure it will come up eventually)
I'll help however I can!
Ideally, we have a documented process as the list is released twice a year and we will want to update then....
help however I can!
By far the most productive would be getting someone to do something awesome involving GlobalNames, which was funded to do exactly what needs done here (and, near-uniquely, actually does what they claimed they're do!).
Distant second would be figuring out how to move a million names around, which is probably going to run into all sorts of infrastructure problems, may or may not work the next time this comes up, etc., etc., etc.
I don't think there's any future in hacking a comprehensive resource up into bite-sized pieces, but I suppose it might come down to that.
but https://arctos.database.museum/name/Alnus isn't very promising
I just refreshed Global Names and https://arctos.database.museum/name/Alnus#WorldFloraOnline
Seems promising enough?
@mvzhuang @dustymc I have been messing around with a download of the World Flora and I feel pretty confident that I can manipulate it in R to get an Arctos classification upload. However:
How or do we want to proceed with this request?
(1) and (2) - IDK, that's weird, if you modify it then it's not really their classification anymore, eh? (Sorta...) No idea....
(3) Good question, that needs a deep review. Echidna-the-eel ain't got nuthin' to do with spiky mammalish critters, our current relationship structure is wholly incapable of dealing with that and needs revised. Something about classifications, I suppose, but IDK needs an Issue. (4) - yea, we really need to treat Authorities like Authorities; we should be adding names because someone has some need and knowledge, not because some apparently-very-limited resource barfed something out onto the interwebs.
How or do we want to proceed
I was planning on adding an 'editor' flag to https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxonomy_source (or something like that, I need to experiment) so we can just use globalnames because https://github.com/ArctosDB/arctos/issues/6500#issuecomment-1626322464, but ??
I was planning on adding an 'editor' flag to https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxonomy_source
I think this would be a fine place to start - BUT we need to be clear that GlobalNames has transformed this data - it is NOT the actual data supplied by the resource (World Flora) and GlobalNames does NOT add everything from the original source (as was demonstrated with WoRMS) so important information may be missing (like the World Flora name identifiers that get you back to their data in DES fashion).
http://www.worldfloraonline.org/taxon/wfo-4000001331
looks a lot more complex than what we are getting from GlobalNames
Also, passing that World Flora identifier makes it pretty clear what taxon you are talking about? It really sucks that GlobalNames doesn't provide these identifiers along with the rest of their data.
GlobalNames has transformed this data
I believe that this is a case of GIGO and the DWC-A (or whatever) that's been provided to globalnames has just been passed on. If that's the case (@dimus might confirm) then the appropriate thing to do would be to ask the 'providers' to do better. (I assume that they assume that nobody's going to USE this stuff and just do whatever's easiest, and might be happy to do more if given a reason, but thatsa lotta assumin' so....).
Orders and family data is in another castle...: http://www.worldfloraonline.org/resource/32768 except that the link is broken..
uh maybe i'll email them and ask if they have a magical file somewhere... or ...a rec from them.
GlobalNames has transformed this data
I believe that this is a case of GIGO and the DWC-A (or whatever) that's been provided to globalnames has just been passed on. If that's the case (@dimus might confirm) then the appropriate thing to do would be to ask the 'providers' to do better. (I assume that they assume that nobody's going to USE this stuff and just do whatever's easiest, and might be happy to do more if given a reason, but thatsa lotta assumin' so....).
@dustymc yes, the data from WFO is coming from their DwCA file https://www.worldfloraonline.org/downloadData;jsessionid=0C2429124EB6BCA3822B00CD824AA02D
Looks like they did not make a new version yet for 2023?
Hello, @dimus told me about this thread, so let me start by summarizing what we are providing in World Flora Online (WFO):
We generate a static copy of the names and taxonomic data every six months and stored it in Zenodo at https://doi.org/10.5281/zenodo.7460141. Each name has a WFO-ID: "wfo-{10digits}". This is the source of the Taxonomic Backbone of the WFO Portal and the World Flora Online Plant List. The information in this file is stored in several different redundant formats and previous versions are kept related. Our latest version of the Taxonomic Backbone is June 2023 and we are currently updating the WFO Portal with this version (current Portal version is March 2023).
From what I read above, you might be interested in the following files:
_DwC_backbone_R.zip This file contains (not "deleted") names in a single Darwin Core Archive file.
We also keep a copy of the latest version of this Taxonomic Backbone file.
families_dwc.tar.gz is a big compressed file with all 718 Plant families in WFO, each family zipped itself in an individual Darwin Core Archive file.
As we update the Portal with the latest Taxonomic Backbone, we also keep copies of these files at: https://files.worldfloraonline.org/files/WFO_Backbone/{FamilyName}, where {FamilyName} is the name of the folder where the corresponding DwCA file for each family is stored. You'll find there is not only the single {FamilyName}.zip DwCA file, but also the expanded component files extracted, so you can see the classification file and the metadata definition - meta.xml - as well.
OrdersAndFamilies.zip. We also keep a DwCA file with the list of names in the higher taxonomy (above families). Although the information is already included in other file, this file in particular has not been included in the Zenodo...yet! But you can find all the contents there stored in: https://files.worldfloraonline.org/files/WFO_Backbone/WFO_Consortium/.
One clarification: we never delete names, we "deprecate" them, so no WFO-ID will be reused for anything different once used. A couple of the files (one of them stored in Zenodo) contain these deprecated names too, in case that you might need them. We also keep a folder archive with all the previous information used, so no file is actually deleted.
Finally, all the WFO taxonomic information is CC0, so please, go ahead and use it and if we can help with some more information or a new format, let us know! For this issue, in particular, if it's better/easier for you, I'll be glad to coordinate with @dimus on what it is that GlobalNames would need in order to attend this issue.
Pura Vida
@WUlate thank you - I did download the latest _DwC_backbone_R.zip file and that is what is being discussed here. Our issue is mostly that we don't currently need the entire dataset and the 'manual' process of downloading and transforming some subset will need to be completed by someone on our end twice a year. I am not sure we have the resources to make that happen plus combine the _DwC_backbone_R.zip and the OrdersAndFamilies.zip to make the classifications complete.
We are able to use GlobalNames, but what is missing for us are the WFO identifiers, which I don't see in GlobalNames (this is also true for WoRMS).
@WUlate yes thank you, and @dimus thanks for getting us all together.
Arctos has been automagically pulling data from GlobalNames for a while now, and it's very useful for discovery or occasionally cloning into a "local" source, but now we'd like to just use what's in GlobalNames for cataloging. A collection preferring World Flora Online as a taxonomic source, when identifying a record to "Alnus," would be (thorough some Arctos magic) using the taxonomic information in https://arctos.database.museum/name/Alnus#WorldFloraOnline, for example. I'll let the collections folks address the details of what's necessary for them to use that, but I think the short version is that they'd it to contain everything it possibly can.
(The missing piece for me, outlier I may be, is whatever we're calling Plantae these days. Arctos contains insects and motorcycles and meteorites and gemstones - all with their own taxonomies - in addition to plants, and high-level ways of separating those things taxonomically are valuable, even if a botanist working in a plant collection would never have use for them.)
@dustymc and @Jegelewicz did you see recordId
field in the JSON format of gnverifier output? It should provide WFO ID.
https://verifier.globalnames.org/?capitalize=on&ds=196&format=json&names=Plantago+major
and for a link back to WFO there is also "outlink": "http://www.worldfloraonline.org/taxon/wfo-0000486544"
@dustymc Could @mvzhuang use the GN World Flora Online as a preferred source yet? Seeing it in action might help us close this or figure out what else needs doing....
Yep recipe in https://github.com/ArctosDB/arctos/issues/6671
@mvzhuang see the revised code table request - are you willing to test drive?
revised
"created with a name matching that which comes from GlobalNames" is a functional requirement.
"created with a name matching that which comes from GlobalNames" is a functional requirement.
Not sure what I am supposed to do with that?
Agh sorry will put this on my calendar for next week. I have to clear out the fossil/taxidermy room by end of month so haven’t been paying attention. But yes will look!
From: Teresa Mayfield-Meyer @.> Sent: Thursday, November 9, 2023 1:54 PM To: ArctosDB/arctos @.> Cc: Zhuang, Mingna @.>; Mention @.> Subject: Re: [ArctosDB/arctos] Code Table Request - New Taxonomy Source World Flora Online (via GlobalNames) (Issue #6500)
EXTERNAL EMAIL: This e-mail is from a sender outside of the UTEP system. Please forward suspicious emails to @.**@.> or call 915.747.6324
@mvzhuanghttps://github.com/mvzhuang see the revised code table request - are you willing to test drive?
— Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/6500#issuecomment-1804670218, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJHJ3OJPOB5MMWMOHBV2AJTYDU7GFAVCNFSM6AAAAAA2ATCRQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBUGY3TAMRRHA. You are receiving this because you were mentioned.Message ID: @.**@.>>
I added this, someone use it.
Working on it!!!!!!
On Mar 1, 2024 7:00 AM, dustymc @.***> wrote:
EXTERNAL EMAIL: This e-mail is from a sender outside of the UTEP system. Please forward suspicious emails to @.**@.> or call 915.747.6324
I added this, someone use it.
— Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/6500#issuecomment-1973256637, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJHJ3OOC3EAY6MVRXSLYEQ3YWCCWTAVCNFSM6AAAAAA2ATCRQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZTGI2TMNRTG4. You are receiving this because you were mentioned.Message ID: @.***>
@dustymc
I added this, someone use it.
Just made WFO rank 1 for our 3 collections. Yay! Thanks
@camwebb (and whoever) - let me know if you want me to force a refresh of the cache, changing the collection's preferences won't do that.
You can
wait for that to clear (reload until the button returns to that state) then click the eyeball...
to see what happens.
And of course I picked a terrible example that does just what you brought up in https://github.com/ArctosDB/arctos/issues/7478...
https://arctos.database.museum/name/Erigeron%20acris#WorldFloraOnline
and the record is https://arctos.database.museum/guid/UAM:Herb:22271
I think there are still some problems with the WFO import and am reopening the issue.
SELECT DISTINCT A.scientific_name
FROM (
SELECT DISTINCT identification.scientific_name
FROM identification
INNER JOIN flat
ON flat.collection_object_id = identification.collection_object_id
WHERE flat.collection_id in (6, 40, 106)
) as A
LEFT JOIN taxon_name
ON A.scientific_name = taxon_name.scientific_name
LEFT JOIN taxon_term
ON taxon_name.taxon_name_id = taxon_term.taxon_name_id AND
taxon_term.source = 'World Flora Online'
WHERE taxon_term.taxon_name_id IS NOT NULL
small part of WFO was imported
Yea, it's not exactly an import, just Arctos pulling from GN when it can (which is a not-so-quick cycle). I can/will set some stuff to prioritize, you can always manually force that with..
import error?
There's some links on eg https://arctos.database.museum/name/Erigeron%20acris
(I'll add the one I actually use, https://resolver.globalnames.org/name_resolvers.json?names=Erigeron%20acris)
Looks like that's what's being sent. The contacts for both the source and the 'packager' are above; they've always been super easy to work with and I'm sure would appreciate feedback from real-world users if you know how that might be better handled. (I would anyway, always seems to me that we build this super-cool infrastructure and then, most of the time - crickets. Seeing stuff ACTUALLY do what we've been talking about forever is great!)
Got it!
current_taxon_id
, but should not 'overwrite' the valid classification of a valid name. I'll contact them as you suggest.Thanks (please close issue again if this seems dealt with)
@camwebb I tried
And this is what I am getting:
does it look reasonable, or something is not right?
@dimus Thanks for picking up on this. I was just about to contact you by email.
I checked what is coming in from WFO and they do not include a hierarchical classification in the record for wfo-0000067521 (Erigeron acris C.B.Clarke), but point to it being a synonym of wfo-0000009039 (Erigeron pulchellus Michx.). So the classification at GN must be added by GN.
My concern is this (and please forgive my bold text and exclamation mark above): the (JSON) classification_path
for Erigeron acris C.B.Clarke is "Plantae|Pteridobiotina|Angiosperms|Asterales|Asteraceae|Asteroideae|Astereae|Conyzinae|Erigeron|Erigeron pulchellus". The current_taxon_id
and current_name_string
are given, indicating that the name is a synonym. So would it not be better/more correct for the classification_path
for Erigeron acris C.B.Clarke is "Plantae|Pteridobiotina|Angiosperms|Asterales|Asteraceae|Asteroideae|Astereae|Conyzinae|Erigeron|Erigeron acris"? It's not a huge deal, but once it has been imported into Arctos it appears that the species of name Erigeron acris is Erigeron pulchellus, which is confusing.
So, to summarize, would GN reconsider the method of generation of classification paths for names that are synonyms? But there may a reason it was decided to do it the way that it is. Thanks!
@camwebb, here was my way of thinking about classification paths:
Assuming that classification contains names of taxons, I felt that a synonym name would be out of place, and did put actual name of the taxon instead (according to the data source). It is a consistent approach over gnverifier, and to my knowledge some people depend on it. So it can be modified, however it would be costly for some people.
@dimus Thanks for explaining your thinking. I can now see pros and cons both ways. So no change is requested - we can work with it as is. It's great to have WFO in Arctos :tada:
Yea, it's not exactly an import, just Arctos pulling from GN
And we should make that VERY clear - GN is NOT the current WFO - it could be a year or more out of date and it DOES NOT have all of the information available in WFO. GN is great, but not doing what I think we really need (which is why WoRMS (via Arctos) exists as it does).
@camwebb I think WFO has probably refreshed by now, here's a slightly different current view of your data:
create table temp_uamherbwfo as select
taxon_name.taxon_name_id,
taxon_name.scientific_name,
count(taxon_term.taxon_name_id) as has_wfo
from
collection
inner join cataloged_item on collection.collection_id=cataloged_item.collection_id
inner join identification on cataloged_item.collection_object_id=identification.collection_object_id
inner join identification_taxonomy on identification.identification_id=identification_taxonomy.identification_id
inner join taxon_name on identification_taxonomy.taxon_name_id=taxon_name.taxon_name_id
left outer join taxon_term on taxon_name.taxon_name_id=taxon_term.taxon_name_id and source='World Flora Online'
where
guid_prefix in ('UAM:Herb','UAM:Alg','UAMb:Herb')
group by
taxon_name.taxon_name_id,
taxon_name.scientific_name
;
SELECT 13417
select count(*) from temp_uamherbwfo where has_wfo > 0;
count
-------
8067
(and thanks for the ticticticsql trick!)
Maybe you can find some fillable gaps somewhere in there.
Your record cache won't have refreshed from that, lemme know if you want me to start that.
Yea, it's not exactly an import, just Arctos pulling from GN
And we should make that VERY clear - GN is NOT the current WFO - it could be a year or more out of date and it DOES NOT have all of the information available in WFO. GN is great, but not doing what I think we really need (which is why WoRMS (via Arctos) exists as it does).
@Jegelewicz @camwebb currently GN updates many 'popular' data-sources 2 times a year, and usually it happens after biannual updates of WFO, so it would be usually up to date with WFO dumps.
I am exploring a faster way to get data in, if it works out, some resources could be updated more often.
@dustymc This is looking very promising, except for one key issue: as it is currently imported, the 'name string' term (non-rank) in current Arctos source WFO is derived from the GN JSON current_name_string
not name_string
, so there is no way for the reader of the classification to determine the correct original->related relationship based on the original Author string of a specimen. Using the Erigeron acris example above, what we need is two entries tied to the canonical name 'Erigeron acris':
It would be great to change this and re-import when you get a chance.
@dimus As I better understand the 'related taxa' search algorithms in Arctos, I now absolutely agree with your choice to give the classification of the accepted name, not the original name. In Arctos, this means that the synonymy relationship is contained within the classification, and does not have to be added to the taxon_relation
table (which is problematic anyway, because this table contains only canonical name mappings and could not handle the case of Erigeron acris above).
change this and re-import
@camwebb I'm not manipulating anything, I'm just writing whatever I get from GN to the DB. You can debug-run the import script with eg https://arctos.database.museum/ScheduledTasks/globalnames_refresh.cfm?debug=true&name=Erigeron%20acris
, maybe you can see something I'm doing wonky in there??
added to the taxon_relation
FWIW, Arctos cannot properly handle relationships (or common names or any other legacy thing that assumes a taxon name means one thing), I've been waiting (not very eagerly!) for someone to request moving them to classifications, something like we get from GN. (I've also been writing stuff to Arctos Relationships to better support search, but there's not really anything that might be mistaken for a relationship in there either.) In the current setup, there's absolutely no way of knowing if https://arctos.database.museum/name/Epeorus is a synonym of a bug, a wet mess, an element, laundry equipment, etc. Wonder if there's some proposal to figure that out and maybe better sync something up with @dimus somewhere in there?
I'm just writing whatever I get from GN to the DB.
But is it possible to add a term to the import? If you look at Array 17, WFO for Erigeron acris C.B. Clarke, there is an entry: matchedName --- string --- Erigeron acris C.B.Clarke. This is needed as a non-rank term in the WFO classification for the user to understand the synonymy. Maybe it could be called 'original name string' to accompany the existing 'name string', which is 'Erigeron puchellus Michx'. Better yet if the current field title 'name string' is switched to 'current name string'.
I've been waiting (not very eagerly!) for someone to request moving them to classifications, something like we get from GN.
Yes. This would be great. I will make that unwelcome request!
@camwebb @mvzhuang Can we close this?
Yes! I think we’re good, just waiting for the rest to pull from GlobalNames slowly!
From: Teresa Mayfield-Meyer @.> Sent: Thursday, May 9, 2024 10:19 AM To: ArctosDB/arctos @.> Cc: Zhuang, Mingna @.>; Mention @.> Subject: Re: [ArctosDB/arctos] Code Table Request - New Taxonomy Source World Flora Online (via GlobalNames) (Issue #6500)
EXTERNAL EMAIL: This e-mail is from a sender outside of the UTEP system. Please forward suspicious emails to @.**@.> or call 915.747.6324
@camwebbhttps://github.com/camwebb @mvzhuanghttps://github.com/mvzhuang Can we close this?
— Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/6500#issuecomment-2102986602, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJHJ3OIIQNYJVVL7GN4NHXTZBOOXLAVCNFSM6AAAAAA2ATCRQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBSHE4DMNRQGI. You are receiving this because you were mentioned.Message ID: @.**@.>>
Instructions
This is a template to facilitate communication with the Arctos Code Table Committee. Submit a separate request for each relevant value. This form is appropriate for exploring how data may best be stored, for adding vocabulary, or for updating existing definitions.
Reviewing documentation before proceeding will result in a more enjoyable experience.
Initial Request
Goal
Create new taxonomy source that reflects current plant taxonomy better
Context
Arctos plants taxonomy for a few families is outdated and it seems World Flora Online is better updated.
Table
(https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxonomy_source)
Proposed Value
Source: World Flora Online (via Arctos)World Flora Online (via GlobalNames
Documentation:
Data from World Flora Online Taxonomic Backbone and must be manually updated.Classifications from World Flora Online as retrieved from GlobalNames. This source is not editable locally and may be one or more versions behind the original source (World Flora Online).
Collection type
Herb
Taxonomic backbone is here: 10.5281/zenodo.7460141 Is there an easy way of loading this into Arctos?
Helpful Actions
[x] Add the issue to the Code Table Management Project.
[x] Please reach out to anyone who might be affected by this change. Leave a comment or add this to the Committee agenda if you believe more focused conversation is necessary.
@ArctosDB/arctos-code-table-administrators
Approval
All of the following must be checked before this may proceed.
_The How-To Document should be followed. Pay particular attention to terminology (with emphasis on consistency) and documentation (with emphasis on functionality). No person should act in multiple roles; the submitter cannot also serve as a Code Table Administrator, for example._
Rejection
If you believe this request should not proceed, explain why here. Suggest any changes that would make the change acceptable, alternate (usually existing) paths to the same goals, etc.
Implementation
Once all of the Approval Checklist is appropriately checked and there are no Rejection comments, or in special circumstances by decree of the Arctos Working Group, the change may be made.
[ ] Review everything one last time. Ensure the How-To has been followed. Ensure all checks have been made by appropriate personnel.
[ ] Add or revise the code table term/definition as described above. Ensure the URL of this Issue is included in the definition.
Close this Issue.
DO NOT modify Arctos Authorities in any way before all points in this Issue have been fully addressed; data loss may result.
Special Exemptions
In very specific cases and by prior approval of The Committee, the approval process may be skipped, and implementation requirements may be slightly altered. Please note here if you are proceeding under one of these use cases.