Fireandplants / plant_gbif

This repository is for data and scripts related to plant species distribution across the globe using the Global Biodiversity Information Facility (GBIF) dataset.
4 stars 2 forks source link

New taxa list based on TPL1.1 synonym expansion #7

Closed dschwilk closed 10 years ago

dschwilk commented 10 years ago

Hi folks,

So last night I took time to write expansion/merging code to use the TPL1.1 data @ejforrestel provided.

  1. In TPL, the statement "there is ONE accepted name for each unique name", is only true when one includes authors. In fact, when ignoring authorship, there are some synonyms that match to multiple accepteds (eg "Amaryllis dubia"). This made it tricky to reverse (merge) an expanded list perfectly. The code now handles that problem. The solution: traverse through those multiple "accepteds" and consider any synonym of each of them a sister synonym to one another. See code https://github.com/Fireandplants/plant_gbif/blob/master/scripts/synonymize.py and 15f2dab . I wrote a general script which will be of use others using this TPL1.1 data and the expansion and merging work with any user specifiable "canonical names" list. If anyone disagrees with the logic of this algorithm let me know. "expand_names" and "merge_names" (followed by removing duplicates) are inverse functions.
  2. So running an expansion on our tank tree names (https://github.com/Fireandplants/bigphylo/blob/master/species/big-phylo-leaves.txt), I produced a new taxa list: https://github.com/Fireandplants/plant_gbif/blob/master/query_names/taxa_for_bigphylo_gbif_query_04_08_14.txt. It expanded to 228168 taxa, about 7 output names per input on average. The commands to produce this are at https://github.com/Fireandplants/plant_gbif/blob/master/scripts/expand-tanknames.sh. See also the commented out bash one liner in that shell script that can demonstrate that expand and merge are inverse functions (well, not strictly true for all cases as there are possible "canonical names" lists that would include certain synonym combinations that would result in expand losing information that could not be recovered, but the tank tree tips are not a problem.
  3. Beth's theplantlist1.1 folder and my synonymize.py could probably be moved to their own repository eventually and further fuzzy matching and name scrubbing utilities created there as this is a more general set of utilities not just for GBIF use. But I have left them in place for now.

-Dylan

dschwilk commented 10 years ago

There are big differences between this new list and the Zanne taxon list, https://github.com/Fireandplants/plant_gbif/blob/master/query_names/Zanne_GBIF_taxa_list_Aug2011.txt

See the code in 53248d9. Pasted below

  ## Check this list against the new one created using synonymize.py and TPL1.1
  ## data

  tanknames.expanded <- read.table("../query_names/taxa_for_bigphylo_gbif_query_04_08_14.txt",  sep = ",",)
  tanknames.expanded = as.character(tanknames.expanded[ , 1])

  not.in.new <- all_sp[!(all_sp %in% tanknames.expanded)]
  length(not.in.new)

  ## what? THere are a lot of taxa in the Zanne list that are not in the tank
  ## tree and don't ahve synonyms according to TPL1.1 Perhaps because The Zanne
  ## et al list included potential gender changes? eg "Abarema adenophora" is in
  ## the Zanne list, is not in the tank tree, and is not a synonym or "sister
  ## synonym" of anything in the tank tree. It is an accepted name as is "Abarema
  ## adenophorum" -- perhaps the Zanne list included this as a synonym based on
  ## matching? I'm not sure

Edit: OK, nope, no synonym of A. adenophorum should be there either according to tpl1.1. Hm, if it were a couple of thousand names I would not worry, but I'm puzzled by the huge (50k!) discrepancy. That is suggesting no concordance between my use of tpl1.1 and the earlier use of tpl1.0 in Zanne et al and I want to understand it I guess before trusting my expanded names list although I have checked against the synonymy table Beth provided and done some manual lookups on TPL to test things.

AmyZanne commented 10 years ago

Thanks Dylan, In your first email that mostly makes sense treating things as sister. However, sometimes mappings can change families, etc. I wonder whether it makes sense to treat as sister only when you remain in close tree space?

Dan and I were trying to figure out the differences in the lists. I think we sent GBIF a much bigger list than what we were able to match to Genbank. We sent ot GBIF anything we could get records for and then any synonymns we could find. The matching with Genbank came later. My guess is that is what explains the discrepancies you found. For instance, I think we sent GBIF ~85K names but only had 32K names in the Tank tree. Does this make sense or do you think there is more to it than that?

Best, Amy

On Wed, Apr 9, 2014 at 3:06 PM, Dylan Schwilk notifications@github.comwrote:

There are big differences between this new list and the Zanne taxon list, https://github.com/Fireandplants/plant_gbif/blob/master/query_names/Zanne_GBIF_taxa_list_Aug2011.txt

See the code comited in 53248d9https://github.com/Fireandplants/plant_gbif/commit/53248d9. Pasted below

Check this list against the new one created using synonymize.py and TPL1.1## data

tanknames.expanded <- read.table("../query_names/taxa_for_bigphylo_gbif_query_04_08_14.txt", sep = ",",)tanknames.expanded = as.character(tanknames.expanded[ , 1]) not.in.new <- all_sp[!(all_sp %in% tanknames.expanded)]length(not.in.new)

what? THere are a lot of taxa in the Zanne list that are not in the tank## tree and don't ahve synonyms according to TPL1.1 Perhaps because The Zanne## et al list included potential gender changes? eg "Abarema adenophora" is in## the Zanne list, is not in the tank tree, and is not a synonym or "sister## synonym" of anything in the tank tree. It is an accepted name as is "Abarema## adenophorum" -- perhaps the Zanne list included this as a synonym based on## matching? I'm not sure

— Reply to this email directly or view it on GitHubhttps://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40003782 .

Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/
dschwilk commented 10 years ago

Hi Amy,

Regarding the Zanne list and my script based on TPL1.1: That makes a lot of sense to me! I kept finding names in the list you sent that I could not figure out where they came from (at least according to TPL1.1). e are working in the other direction on this bigphylo project: starting with the Tank tree names.

Regarding "sister synonyms" -- I need to check that and see. Yes, perhaps my algorithm "over-matches". But at least using the Tank tree, it never expands out to another existing name, so I don't think we can check tree distance. If it did, it would not be perfectly reversible. In theory, one could hand my program a names list that was not reversible (expand then merge), but the Tank et al tree works as the expand and merge steps result in identical lists. Each name expands out, but the resulting synonym list never contains another tank tree canonical name. We could restrict mappings based on taxonomy, say never move families?

-Dylan

On 04/19/2014 12:16 PM, AmyZanne wrote:

Thanks Dylan, In your first email that mostly makes sense treating things as sister. However, sometimes mappings can change families, etc. I wonder whether it makes sense to treat as sister only when you remain in close tree space?

Dan and I were trying to figure out the differences in the lists. I think we sent GBIF a much bigger list than what we were able to match to Genbank. We sent ot GBIF anything we could get records for and then any synonymns we could find. The matching with Genbank came later. My guess is that is what explains the discrepancies you found. For instance, I think we sent GBIF ~85K names but only had 32K names in the Tank tree. Does this make sense or do you think there is more to it than that?

Best, Amy

On Wed, Apr 9, 2014 at 3:06 PM, Dylan Schwilk notifications@github.comwrote:

There are big differences between this new list and the Zanne taxon list,

https://github.com/Fireandplants/plant_gbif/blob/master/query_names/Zanne_GBIF_taxa_list_Aug2011.txt

See the code comited in 53248d9https://github.com/Fireandplants/plant_gbif/commit/53248d9. Pasted below

Check this list against the new one created using synonymize.py and

TPL1.1## data tanknames.expanded <- read.table("../query_names/taxa_for_bigphylo_gbif_query_04_08_14.txt", sep = ",",)tanknames.expanded = as.character(tanknames.expanded[ , 1]) not.in.new <- all_sp[!(all_sp %in% tanknames.expanded)]length(not.in.new)

what? THere are a lot of taxa in the Zanne list that are not in the

tank## tree and don't ahve synonyms according to TPL1.1 Perhaps because The Zanne## et al list included potential gender changes? eg "Abarema adenophora" is in## the Zanne list, is not in the tank tree, and is not a synonym or "sister## synonym" of anything in the tank tree. It is an accepted name as is "Abarema## adenophorum" -- perhaps the Zanne list included this as a synonym based on## matching? I'm not sure

— Reply to this email directly or view it on GitHubhttps://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40003782

.

Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/

— Reply to this email directly or view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40875028.

AmyZanne commented 10 years ago

Hi Dylan, Great, glad that makes sense. It took us awhile to winnow down to a stable list of names and even the database we could use to map to. We didn't have access to the Plant List until later so had to start with IPNI.

I don't think we should be super fussed about under/over matching. At the scale we are working, it's all pretty gross but we want to be consistent. It would be good to flag and investigate those sister synonyms that map outside of say a family (genera are too unstable anyway) and then decide if we believe it. My guess is there will be examples but not a ton. For instance, if we are working around in the area of the flacourts or euporbs my guess is we will pick up family switches since those were a mess until a decade or so ago. Does that make sense?

Best, Amy

On Sat, Apr 19, 2014 at 7:49 PM, Dylan Schwilk notifications@github.comwrote:

Hi Amy,

Re the Zanne list and my script based on TPL1.1: That makes a lot of sense to me! I kept finding names in the list you sent that I could not figure out where they came from (at least according to TPL1.1). e are working in the other direction on this bigphylo project: starting with the Tank tree names.

Re "sister synonyms" -- I need to check that and see. Yes, perhaps my algorithm "over-matches". But at least using the Tank tree, it never expands out to another existing name, so I don't think we can check tree distance. If it did, it would not be perfectly reversible. In theory, one could hand my program a names list that was not reversible (expand then merge), but the Tank et al tree works as the expand and merge steps result in identical lists. Each name expands out, but the resulting synonym list never contains another tank tree canonical name. We could restrict mappings based on taxonomy, say never move families?

-Dylan

On 04/19/2014 12:16 PM, AmyZanne wrote:

Thanks Dylan, In your first email that mostly makes sense treating things as sister. However, sometimes mappings can change families, etc. I wonder whether it makes sense to treat as sister only when you remain in close tree space?

Dan and I were trying to figure out the differences in the lists. I think we sent GBIF a much bigger list than what we were able to match to Genbank. We sent ot GBIF anything we could get records for and then any synonymns we could find. The matching with Genbank came later. My guess is that is what explains the discrepancies you found. For instance, I think we sent GBIF ~85K names but only had 32K names in the Tank tree. Does this make sense or do you think there is more to it than that?

Best, Amy

On Wed, Apr 9, 2014 at 3:06 PM, Dylan Schwilk notifications@github.comwrote:

There are big differences between this new list and the Zanne taxon list,

https://github.com/Fireandplants/plant_gbif/blob/master/query_names/Zanne_GBIF_taxa_list_Aug2011.txt

See the code comited in 53248d9https://github.com/Fireandplants/plant_gbif/commit/53248d9.

Pasted below

Check this list against the new one created using synonymize.py and

TPL1.1## data

tanknames.expanded <- read.table("../query_names/taxa_for_bigphylo_gbif_query_04_08_14.txt", sep = ",",)tanknames.expanded = as.character(tanknames.expanded[ , 1])

not.in.new <- all_sp[!(all_sp %in% tanknames.expanded)]length(not.in.new)

what? THere are a lot of taxa in the Zanne list that are not in the

tank## tree and don't ahve synonyms according to TPL1.1 Perhaps because The Zanne## et al list included potential gender changes? eg "Abarema adenophora" is in## the Zanne list, is not in the tank tree, and is not a synonym or "sister## synonym" of anything in the tank tree. It is an accepted name as is "Abarema## adenophorum" -- perhaps the Zanne list included this as a synonym based on## matching? I'm not sure

— Reply to this email directly or view it on GitHub< https://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40003782

.

Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/

— Reply to this email directly or view it on GitHub < https://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40875028 .

— Reply to this email directly or view it on GitHubhttps://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40883817 .

Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/
dschwilk commented 10 years ago

Sounds good. I'll think about adding an option to the code to flag synonyms that switch families during the "merge" action. But no need to wait on that for our GBIF query.

The one outstanding issue is that this whole process doesn't deal with fuzzy matching because that would require complete access to both sides of the lookup. So I am ignoring that for GBIF records. But for trait databases we can add that step and I'll start on some clean tools for that.

ejforrestel commented 10 years ago

I agree that we should ignore any fuzzy matching for GBIF (but we could always do the GBIF query ourselves which would enable us to do this).

I think the code takes a bit to run, so should we get this synonyms sent off so we can get cleaned GBIF records off to Sally soon?

Happy Sunday (and Easter if it pertains)!

Beth

On Sun, Apr 20, 2014 at 10:18 AM, Dylan Schwilk notifications@github.comwrote:

Sounds good. I'll think about adding an option to the code to flag synonyms that switch families during the "merge" action. But no need to wait on that for our GBIF query.

The one outstanding issue is that this whole process doesn't deal with fuzzy matching because that would require complete access to both sides of the lookup. So I am ignoring that for GBIF records. But for trait databases we can add that step and I'll start on some clean tools for that.

Reply to this email directly or view it on GitHubhttps://github.com/Fireandplants/plant_gbif/issues/7#issuecomment-40895826 .

dschwilk commented 10 years ago

So folks, I think we are good to go on this. @dmcglinn : do you need more info from me before we submit the GBIF query using this name list?

dschwilk commented 10 years ago

Hi folks, just an update:

I am working on this and I have the full GBIF plantae download. The reason: having the full GBIF names makes fuzzy matching possible -- synonym expansion helps, but we need all the GBIF names to match against misspellings, dropped hyphenated parts, etc. I'll push some results soon. I'm working with Will Pearse on the fuzzy matching problem itself in another repo which is why this repo has commits or updates.

dschwilk commented 10 years ago

Current workflow updated, see #8