Closed davechan closed 10 years ago
We should do this. Could you post short example csv file that dedupe currently cannot handle.
@davechan said:
I’m new to Github so not sure how to post the csv file to the repo. I’ve attached a csv with Chinese names and known duplicates in the cluster column.
I was able to load the Unicode file by replacing csv library with unicodecsv, however the program failed at blocking. Also we may need another distance function for non-English words I suppose?
Thanks @davechan.
You are absolutely right. The current distance function for strings, a variant of the Levenstein distance, is unlikely to work as well on a non-english alphabet, and really won't work well at all on idiographic systems. That's one reason why we don't have good support for unicode now.
The key to using distance functions with Mandarin is tokenisation. It's a non-trivial problem to solve - particularly across domains. Fortunately, lots of people are focused on this problem right now: See for reference LingPipe: http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html and Datasift: http://dev.datasift.com/csdl/tokenization-and-csdl-engine/mandarin-chunking
For purpose of de-duplication, we don't actually need to understand the sentence structure or meaning of the words. Can we simply take the unicode string and run list(), then run distance function on the UTF8 characters? would Levenshtein distance treat each UTF8 character as 1 character?
list(u"这是一个句子")
On Tue, Jun 17, 2014 at 9:58 AM, Forest Gregg notifications@github.com wrote:
Thanks @davechan https://github.com/davechan.
You are absolutely right. The current distance function for strings, a variant of the Levenstein distance, is unlikely to work as well on a non-english alphabet, and really won't work well at all on idiographic systems. That's one reason why we don't have good support for unicode now.
— Reply to this email directly or view it on GitHub https://github.com/datamade/dedupe/issues/257#issuecomment-46259570.
Yes, we can (or we can with some modest modifications to the code).
Reading unicode data can be resolved by replacing csv with unicodecsv, but the blocking function seems to choke on unicode - any ideas how to make it work for non-ASCII input?