MScientistCM / CMDuplicatesFinder

A tool to find duplicate entries in a cm0102 database.
1 stars 0 forks source link

Duplicate Finder 0.4 Beta, feedback from Andrea #9

Closed MScientistCM closed 5 years ago

MScientistCM commented 5 years ago

I'm having a look at the new beta of the duplicate finder.

I run it vs v198, because I have a full run of my own duplicator available for that version, in order to compare the outputs.

My first impression is positive, projection is about 2000 duplicates. Despite Levenshtein implementation, speed is ok, though it looks like the tool maybe doesn't check for nation field anymore (or is it?).

Unfortunately, after 2 hours and 88,65% of progress, at 14:19, the tool stopped updating the duplicates list (both on screen and on file), and it stopped updating pause data. The cpu load stayed at maximum though, apparently in loop. At 14:44 I decided to stop the program via the cross button.

I executed it again, but it resumed in loop again, so I closed it for a second time.

One hour later I discovered both of the instances were not actually closed: they were still running at full load (in the Task Manager)!

MScientistCM commented 5 years ago

Thanks Andrea for your feedback. Ive run it against october 2018 db and it finished 100% without issues.

Can you send me exact the same csv file you used and pause file so I can easily debug it (as currently ive no idea what could cause it, maybe some special char in a name or something)? You can upload them into this ticket if you prefer and if thats not confidential database.

About nation, it never checked for this field, that will be implemented in future versions.

MScientistCM commented 5 years ago

i fixed this bug, thanks andrea for great feedback, i tested with pause file provided by andrea and works fine now. Issue was causing infinite loop due to changing iterator inside loop.

Will be included when I release v0.5-beta.