andrewxhill / MOL

The Map of Life
mol.colorado.edu/
19 stars 4 forks source link

Duplicate (or two separate) ecoregions for a single taxon #82

Closed robgur closed 12 years ago

robgur commented 13 years ago

I have found a couple cases of duplicate (or two separate sets of) ecoregions for a single taxon. An example is a search on "Melanoptila glabrirostris".

eightysteele commented 13 years ago

Totally. Assigning issue to myself and changed the milestone to Demo.

eightysteele commented 13 years ago

Update: I'm running a MapReduce job over MasterSearchIndex and so far discovered over 1000 duplicates. Also finding entities with a parent that references an OccurrenceSet not in the datastore. I'll know more after the job is complete.

@tucotuco: As a sanity check can you see if the bulkloader is checking for existing MSI before putting new ones? That might be part of what's happening here and should be a fairly painless fix. Maybe?

Changed issue labels to Priority-High and Status-Started.

eightysteele commented 13 years ago

343,075 MasterSearchIndex entities, 1189 are duplicates, 27158 have non-existent OccurrenceSet parents. Here's the full list:

https://gist.github.com/1116099

Will update when I know more.

eightysteele commented 13 years ago

Couple options here. We can write code to fix duplicates and parent key errors, or harden the bulkloading code/process and clear/reload datastore where needed. I think the latter is a better solution, especially as we ramp up adding new layers to MoL.