Closed dschwilk closed 7 years ago
This is almost complete. Steps accomplished:
Note that the new gbif data dump actually has fewer unique names (426,632 vs 448,585) although many more records. I'm not sure why. I am more confident in the fuzzy matching as I added a few data checking steps to avoid false matches. I've been a bit more conservative with fuzzy matching. We end up 48,767 matches of which 3,154 are fuzzy matches. This is fewer than the 65,366 we had before. So we'll see what happens to number of records. Fewer names is not necessarily a bad thing. We only have 27K canonical names, so most are synonyms and 3K are misspellings. Fewer names is most likely a consequence of gbif data being cleaned and names updated to accepted names in the last couple of years. Now, if we hit fewer occurrences then something is wrong.
DONE!
Total records scanned = 180344517 Total matches found = 107705783
@dmcglinn , I'll compress this file and get it to you somehow. Maybe dropbox? Other server?
awesome great job!
On Thu, Apr 20, 2017 at 3:43 PM, Dylan Schwilk notifications@github.com wrote:
DONE!
Total records scanned = 180344517 Total matches found = 107705783
@dmcglinn https://github.com/dmcglinn , I'll compress this file and get it to you somehow. Maybe dropbox? Other server?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/26#issuecomment-295876702, or mute the thread https://github.com/notifications/unsubscribe-auth/ABIp7XRa7bu6f_JXDmXD9r4mrQs3EdK_ks5rx7V_gaJpZM4NCJkT .
-- Daniel J. McGlinn, PhD Assistant Professor College of Charleston Department of Biology Harbor Walk West, rm 203 360 Concord Street Charleston, SC 29401 http://mcglinn.web.unc.edu/ office: 843-953-0190 cell: 405-612-1780
Done. Now these data can be cleaned according to revised cleaning steps (#22).
I have time to run a new gbif extraction. If I have time, I'll make a new synonym table, if not I'll use the old lookup table but catch any new records. I'm downloading gbif plantae data (~180 million records) now.