fasterthanlime / cs322

CS-323 project
2 stars 1 forks source link

Removing duplicates #34

Closed greut closed 12 years ago

greut commented 12 years ago

In the import script many players are created multiple times.

ABDULKA01    Abdul-jabbar    Kareem
ABDULKA01    Alcindor           Lew

DANTMI01    Mike D'antoni
DANTMI01    Mike DAntoni

VANBRJA01 Jan van Breda Kloff
VANBRJA01 Jan van breda kolff

…

Find a way to clean that up, ILKID shall be unique which may imply some kind of cleanup (DECODE could be useful here for case to case fixes).

SeZuo commented 12 years ago

In fact ILKIDs are unique. It seems that all players that made past the drafts have one and if there was an "Abdula Kareem" the resulting ILKID would be ABDULKA02 as Abdul-jabbar Kareem is # 01.

We only need to generate ILKIDs for drafted players without one.

Tell me if you see a case proving overwise.

greut commented 12 years ago

We only need to generate ILKIDs for drafted players without one.

We won't! Not duplicating ILKID must be enough. Ping me / email me, if you need help with import.rake.

greut commented 12 years ago

If it's above what you can do, tell it soon enough. Cheers,

SeZuo commented 12 years ago

The thing is we'll miss 5217 entries by just ignoring the drafted ones with null ilkid. That is a lot of data to loose.

If we import them with their null ilkids, we have no real way to compare them, which introduce new duplications (most of those I've spotted in the first place).

I'm trying to use Levenshtein's distance to better avoid duplicates, but it doesn't seem possible to define a function inside the import.rake.

Where did you define the such as people_seq?

SeZuo commented 12 years ago

Ok I've found some oracle function to help me out (http://psoug.org/reference/utl_match.html) but it's too CPU intensive: would not complete the request within the 15 minutes it ran on my machine so far.

I'm open to suggestion, in the meantime I'll commit the ILKID-duplicate-free import.rake.

greut commented 12 years ago

Okay, I think I'll do it.