Fireandplants / plant_gbif

This repository is for data and scripts related to plant species distribution across the globe using the Global Biodiversity Information Facility (GBIF) dataset.
4 stars 2 forks source link

fire evol project needs new gbif data #26

Closed dschwilk closed 7 years ago

dschwilk commented 7 years ago

I have time to run a new gbif extraction. If I have time, I'll make a new synonym table, if not I'll use the old lookup table but catch any new records. I'm downloading gbif plantae data (~180 million records) now.

dschwilk commented 7 years ago

This is almost complete. Steps accomplished:

  1. Download new gbif data dump (180 million records)
  2. Extract all names
  3. Fuzzy match against expanded Tank et al tree names
  4. Manually clean up fuzzy matches (all day today until just now :))
  5. Extract gbif occurrences based on new names list (running now)

Note that the new gbif data dump actually has fewer unique names (426,632 vs 448,585) although many more records. I'm not sure why. I am more confident in the fuzzy matching as I added a few data checking steps to avoid false matches. I've been a bit more conservative with fuzzy matching. We end up 48,767 matches of which 3,154 are fuzzy matches. This is fewer than the 65,366 we had before. So we'll see what happens to number of records. Fewer names is not necessarily a bad thing. We only have 27K canonical names, so most are synonyms and 3K are misspellings. Fewer names is most likely a consequence of gbif data being cleaned and names updated to accepted names in the last couple of years. Now, if we hit fewer occurrences then something is wrong.

dschwilk commented 7 years ago

DONE!

Total records scanned = 180344517 Total matches found = 107705783

@dmcglinn , I'll compress this file and get it to you somehow. Maybe dropbox? Other server?

dmcglinn commented 7 years ago

awesome great job!

On Thu, Apr 20, 2017 at 3:43 PM, Dylan Schwilk notifications@github.com wrote:

DONE!

Total records scanned = 180344517 Total matches found = 107705783

@dmcglinn https://github.com/dmcglinn , I'll compress this file and get it to you somehow. Maybe dropbox? Other server?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/26#issuecomment-295876702, or mute the thread https://github.com/notifications/unsubscribe-auth/ABIp7XRa7bu6f_JXDmXD9r4mrQs3EdK_ks5rx7V_gaJpZM4NCJkT .

-- Daniel J. McGlinn, PhD Assistant Professor College of Charleston Department of Biology Harbor Walk West, rm 203 360 Concord Street Charleston, SC 29401 http://mcglinn.web.unc.edu/ office: 843-953-0190 cell: 405-612-1780

dschwilk commented 7 years ago

Done. Now these data can be cleaned according to revised cleaning steps (#22).