anhaidgroup / py_stringmatching

A comprehensive and scalable set of string tokenizers and similarity measures in Python
https://sites.google.com/site/anhaidgroup/projects/py_stringmatching
BSD 3-Clause "New" or "Revised" License
135 stars 16 forks source link

Error if the string is not encoded in utf-8 #39

Open SrujithPoondla opened 7 years ago

SrujithPoondla commented 7 years ago

The current code doesn't handle non unicode format strings. If the string is not encoded in utf-8 then it throws the UnicodeDecodeError.

Error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xd1 in position 8: invalid continuation byte

Dev Analysis: String : "sharkey_„Žs cafe" I used convert_to_unicode in utils.py to convert non unicode strings to unicode but still it was not able to parse the non unicode characters. At this moment we are ignoring the non unicode strings, but they need to be handled.