dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.15k stars 551 forks source link

Handle unicode input #257

Closed davechan closed 10 years ago

davechan commented 10 years ago

Reading unicode data can be resolved by replacing csv with unicodecsv, but the blocking function seems to choke on unicode - any ideas how to make it work for non-ASCII input? image

fgregg commented 10 years ago

We should do this. Could you post short example csv file that dedupe currently cannot handle.

fgregg commented 10 years ago

@davechan said:

I’m new to Github so not sure how to post the csv file to the repo. I’ve attached a csv with Chinese names and known duplicates in the cluster column.

I was able to load the Unicode file by replacing csv library with unicodecsv, however the program failed at blocking. Also we may need another distance function for non-English words I suppose?

https://gist.github.com/fgregg/26852357cb851787eb54

fgregg commented 10 years ago

Thanks @davechan.

You are absolutely right. The current distance function for strings, a variant of the Levenstein distance, is unlikely to work as well on a non-english alphabet, and really won't work well at all on idiographic systems. That's one reason why we don't have good support for unicode now.

hudgeon commented 10 years ago

The key to using distance functions with Mandarin is tokenisation. It's a non-trivial problem to solve - particularly across domains. Fortunately, lots of people are focused on this problem right now: See for reference LingPipe: http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html and Datasift: http://dev.datasift.com/csdl/tokenization-and-csdl-engine/mandarin-chunking

davechan commented 10 years ago

For purpose of de-duplication, we don't actually need to understand the sentence structure or meaning of the words. Can we simply take the unicode string and run list(), then run distance function on the UTF8 characters? would Levenshtein distance treat each UTF8 character as 1 character?

list(u"这是一个句子")

On Tue, Jun 17, 2014 at 9:58 AM, Forest Gregg notifications@github.com wrote:

Thanks @davechan https://github.com/davechan.

You are absolutely right. The current distance function for strings, a variant of the Levenstein distance, is unlikely to work as well on a non-english alphabet, and really won't work well at all on idiographic systems. That's one reason why we don't have good support for unicode now.

— Reply to this email directly or view it on GitHub https://github.com/datamade/dedupe/issues/257#issuecomment-46259570.

fgregg commented 10 years ago

Yes, we can (or we can with some modest modifications to the code).