Closed dschwilk closed 10 years ago
There are big differences between this new list and the Zanne taxon list, https://github.com/Fireandplants/plant_gbif/blob/master/query_names/Zanne_GBIF_taxa_list_Aug2011.txt
See the code in 53248d9. Pasted below
## Check this list against the new one created using synonymize.py and TPL1.1
## data
tanknames.expanded <- read.table("../query_names/taxa_for_bigphylo_gbif_query_04_08_14.txt", sep = ",",)
tanknames.expanded = as.character(tanknames.expanded[ , 1])
not.in.new <- all_sp[!(all_sp %in% tanknames.expanded)]
length(not.in.new)
## what? THere are a lot of taxa in the Zanne list that are not in the tank
## tree and don't ahve synonyms according to TPL1.1 Perhaps because The Zanne
## et al list included potential gender changes? eg "Abarema adenophora" is in
## the Zanne list, is not in the tank tree, and is not a synonym or "sister
## synonym" of anything in the tank tree. It is an accepted name as is "Abarema
## adenophorum" -- perhaps the Zanne list included this as a synonym based on
## matching? I'm not sure
Edit: OK, nope, no synonym of A. adenophorum should be there either according to tpl1.1. Hm, if it were a couple of thousand names I would not worry, but I'm puzzled by the huge (50k!) discrepancy. That is suggesting no concordance between my use of tpl1.1 and the earlier use of tpl1.0 in Zanne et al and I want to understand it I guess before trusting my expanded names list although I have checked against the synonymy table Beth provided and done some manual lookups on TPL to test things.
Thanks Dylan, In your first email that mostly makes sense treating things as sister. However, sometimes mappings can change families, etc. I wonder whether it makes sense to treat as sister only when you remain in close tree space?
Dan and I were trying to figure out the differences in the lists. I think we sent GBIF a much bigger list than what we were able to match to Genbank. We sent ot GBIF anything we could get records for and then any synonymns we could find. The matching with Genbank came later. My guess is that is what explains the discrepancies you found. For instance, I think we sent GBIF ~85K names but only had 32K names in the Tank tree. Does this make sense or do you think there is more to it than that?
Best, Amy
On Wed, Apr 9, 2014 at 3:06 PM, Dylan Schwilk notifications@github.comwrote:
There are big differences between this new list and the Zanne taxon list, https://github.com/Fireandplants/plant_gbif/blob/master/query_names/Zanne_GBIF_taxa_list_Aug2011.txt
See the code comited in 53248d9https://github.com/Fireandplants/plant_gbif/commit/53248d9. Pasted below
Check this list against the new one created using synonymize.py and TPL1.1## data
tanknames.expanded <- read.table("../query_names/taxa_for_bigphylo_gbif_query_04_08_14.txt", sep = ",",)tanknames.expanded = as.character(tanknames.expanded[ , 1]) not.in.new <- all_sp[!(all_sp %in% tanknames.expanded)]length(not.in.new)
what? THere are a lot of taxa in the Zanne list that are not in the tank## tree and don't ahve synonyms according to TPL1.1 Perhaps because The Zanne## et al list included potential gender changes? eg "Abarema adenophora" is in## the Zanne list, is not in the tank tree, and is not a synonym or "sister## synonym" of anything in the tank tree. It is an accepted name as is "Abarema## adenophorum" -- perhaps the Zanne list included this as a synonym based on## matching? I'm not sure
— Reply to this email directly or view it on GitHubhttps://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40003782 .
Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052
Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/
Hi Amy,
Regarding the Zanne list and my script based on TPL1.1: That makes a lot of sense to me! I kept finding names in the list you sent that I could not figure out where they came from (at least according to TPL1.1). e are working in the other direction on this bigphylo project: starting with the Tank tree names.
Regarding "sister synonyms" -- I need to check that and see. Yes, perhaps my algorithm "over-matches". But at least using the Tank tree, it never expands out to another existing name, so I don't think we can check tree distance. If it did, it would not be perfectly reversible. In theory, one could hand my program a names list that was not reversible (expand then merge), but the Tank et al tree works as the expand and merge steps result in identical lists. Each name expands out, but the resulting synonym list never contains another tank tree canonical name. We could restrict mappings based on taxonomy, say never move families?
-Dylan
On 04/19/2014 12:16 PM, AmyZanne wrote:
Thanks Dylan, In your first email that mostly makes sense treating things as sister. However, sometimes mappings can change families, etc. I wonder whether it makes sense to treat as sister only when you remain in close tree space?
Dan and I were trying to figure out the differences in the lists. I think we sent GBIF a much bigger list than what we were able to match to Genbank. We sent ot GBIF anything we could get records for and then any synonymns we could find. The matching with Genbank came later. My guess is that is what explains the discrepancies you found. For instance, I think we sent GBIF ~85K names but only had 32K names in the Tank tree. Does this make sense or do you think there is more to it than that?
Best, Amy
On Wed, Apr 9, 2014 at 3:06 PM, Dylan Schwilk notifications@github.comwrote:
There are big differences between this new list and the Zanne taxon list,
https://github.com/Fireandplants/plant_gbif/blob/master/query_names/Zanne_GBIF_taxa_list_Aug2011.txt
See the code comited in 53248d9https://github.com/Fireandplants/plant_gbif/commit/53248d9. Pasted below
Check this list against the new one created using synonymize.py and
TPL1.1## data tanknames.expanded <- read.table("../query_names/taxa_for_bigphylo_gbif_query_04_08_14.txt", sep = ",",)tanknames.expanded = as.character(tanknames.expanded[ , 1]) not.in.new <- all_sp[!(all_sp %in% tanknames.expanded)]length(not.in.new)
what? THere are a lot of taxa in the Zanne list that are not in the
tank## tree and don't ahve synonyms according to TPL1.1 Perhaps because The Zanne## et al list included potential gender changes? eg "Abarema adenophora" is in## the Zanne list, is not in the tank tree, and is not a synonym or "sister## synonym" of anything in the tank tree. It is an accepted name as is "Abarema## adenophorum" -- perhaps the Zanne list included this as a synonym based on## matching? I'm not sure
— Reply to this email directly or view it on GitHubhttps://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40003782
.
Dr. Amy Zanne Department of Biological Sciences 2023 G St. NW George Washington University Washington, DC 20052 Office: 352 Lisner Hall Office Phone: (202) 994-8751 Lab: 409 Bell Hall Lab Phone: (202) 994-9613 Fax: (202) 994-6100 Website: http://www.phylodiversity.net/azanne/
— Reply to this email directly or view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40875028.
Hi Dylan, Great, glad that makes sense. It took us awhile to winnow down to a stable list of names and even the database we could use to map to. We didn't have access to the Plant List until later so had to start with IPNI.
I don't think we should be super fussed about under/over matching. At the scale we are working, it's all pretty gross but we want to be consistent. It would be good to flag and investigate those sister synonyms that map outside of say a family (genera are too unstable anyway) and then decide if we believe it. My guess is there will be examples but not a ton. For instance, if we are working around in the area of the flacourts or euporbs my guess is we will pick up family switches since those were a mess until a decade or so ago. Does that make sense?
Best, Amy
On Sat, Apr 19, 2014 at 7:49 PM, Dylan Schwilk notifications@github.comwrote:
Hi Amy,
Re the Zanne list and my script based on TPL1.1: That makes a lot of sense to me! I kept finding names in the list you sent that I could not figure out where they came from (at least according to TPL1.1). e are working in the other direction on this bigphylo project: starting with the Tank tree names.
Re "sister synonyms" -- I need to check that and see. Yes, perhaps my algorithm "over-matches". But at least using the Tank tree, it never expands out to another existing name, so I don't think we can check tree distance. If it did, it would not be perfectly reversible. In theory, one could hand my program a names list that was not reversible (expand then merge), but the Tank et al tree works as the expand and merge steps result in identical lists. Each name expands out, but the resulting synonym list never contains another tank tree canonical name. We could restrict mappings based on taxonomy, say never move families?
-Dylan
On 04/19/2014 12:16 PM, AmyZanne wrote:
Thanks Dylan, In your first email that mostly makes sense treating things as sister. However, sometimes mappings can change families, etc. I wonder whether it makes sense to treat as sister only when you remain in close tree space?
Dan and I were trying to figure out the differences in the lists. I think we sent GBIF a much bigger list than what we were able to match to Genbank. We sent ot GBIF anything we could get records for and then any synonymns we could find. The matching with Genbank came later. My guess is that is what explains the discrepancies you found. For instance, I think we sent GBIF ~85K names but only had 32K names in the Tank tree. Does this make sense or do you think there is more to it than that?
Best, Amy
On Wed, Apr 9, 2014 at 3:06 PM, Dylan Schwilk notifications@github.comwrote:
There are big differences between this new list and the Zanne taxon list,
https://github.com/Fireandplants/plant_gbif/blob/master/query_names/Zanne_GBIF_taxa_list_Aug2011.txt
See the code comited in 53248d9https://github.com/Fireandplants/plant_gbif/commit/53248d9.
Pasted below
Check this list against the new one created using synonymize.py and
TPL1.1## data
tanknames.expanded <- read.table("../query_names/taxa_for_bigphylo_gbif_query_04_08_14.txt", sep = ",",)tanknames.expanded = as.character(tanknames.expanded[ , 1])
not.in.new <- all_sp[!(all_sp %in% tanknames.expanded)]length(not.in.new)
what? THere are a lot of taxa in the Zanne list that are not in the
tank## tree and don't ahve synonyms according to TPL1.1 Perhaps because The Zanne## et al list included potential gender changes? eg "Abarema adenophora" is in## the Zanne list, is not in the tank tree, and is not a synonym or "sister## synonym" of anything in the tank tree. It is an accepted name as is "Abarema## adenophorum" -- perhaps the Zanne list included this as a synonym based on## matching? I'm not sure
— Reply to this email directly or view it on GitHub< https://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40003782
.
Dr. Amy Zanne Department of Biological Sciences 2023 G St. NW George Washington University Washington, DC 20052 Office: 352 Lisner Hall Office Phone: (202) 994-8751 Lab: 409 Bell Hall Lab Phone: (202) 994-9613 Fax: (202) 994-6100 Website: http://www.phylodiversity.net/azanne/
— Reply to this email directly or view it on GitHub < https://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40875028 .
— Reply to this email directly or view it on GitHubhttps://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40883817 .
Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052
Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/
Sounds good. I'll think about adding an option to the code to flag synonyms that switch families during the "merge" action. But no need to wait on that for our GBIF query.
The one outstanding issue is that this whole process doesn't deal with fuzzy matching because that would require complete access to both sides of the lookup. So I am ignoring that for GBIF records. But for trait databases we can add that step and I'll start on some clean tools for that.
I agree that we should ignore any fuzzy matching for GBIF (but we could always do the GBIF query ourselves which would enable us to do this).
I think the code takes a bit to run, so should we get this synonyms sent off so we can get cleaned GBIF records off to Sally soon?
Happy Sunday (and Easter if it pertains)!
Beth
On Sun, Apr 20, 2014 at 10:18 AM, Dylan Schwilk notifications@github.comwrote:
Sounds good. I'll think about adding an option to the code to flag synonyms that switch families during the "merge" action. But no need to wait on that for our GBIF query.
The one outstanding issue is that this whole process doesn't deal with fuzzy matching because that would require complete access to both sides of the lookup. So I am ignoring that for GBIF records. But for trait databases we can add that step and I'll start on some clean tools for that.
Reply to this email directly or view it on GitHubhttps://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40895826 .
So folks, I think we are good to go on this. @dmcglinn : do you need more info from me before we submit the GBIF query using this name list?
Hi folks, just an update:
I am working on this and I have the full GBIF plantae download. The reason: having the full GBIF names makes fuzzy matching possible -- synonym expansion helps, but we need all the GBIF names to match against misspellings, dropped hyphenated parts, etc. I'll push some results soon. I'm working with Will Pearse on the fuzzy matching problem itself in another repo which is why this repo has commits or updates.
Current workflow updated, see #8
Hi folks,
So last night I took time to write expansion/merging code to use the TPL1.1 data @ejforrestel provided.
-Dylan