elyase / geotext

Geotext extracts country and city mentions from text
MIT License
135 stars 48 forks source link

bug fix: encoding missing from open() #10

Open DevinCharles opened 7 years ago

DevinCharles commented 7 years ago

Encoding option that is passed to read_table was never passed to open() command.

Should fix ISSUE #8: UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 2: illegal multibyte sequence

elyase commented 7 years ago

Hey, thanks for the fix! The tests are failing because the encoding option is not supported in Python 2. Would it be too much to ask that you use something from:

https://stackoverflow.com/questions/10971033/backporting-python-3-openencoding-utf-8-to-python-2

so that the library can stay Python compatible?

DevinCharles commented 7 years ago

There appears to be a difference in the way Python 3 and 2 handle .decode('utf-8')... so my most recent commit fails test_cities....

Python 3

['\ufeffSao Paulo é a capital do estado de Sao Paulo. As cidades de Barueri\r\n',
 'e Carapicuíba fazem parte da Grade Sao Paulo. O Rio de Janeiro\r\n',
 'continua lindo. No carnaval eu vou para Salvador. No reveillon eu \r\n',
 'quero ir para Santos.']

Python 2

[u'\ufeffS\xe3o Paulo \xe9 a capital do estado de S\xe3o Paulo. As cidades de Barueri\r\n',
 u'e Carapicu\xedba fazem parte da Grade S\xe3o Paulo. O Rio de Janeiro\r\n',
 u'continua lindo. No carnaval eu vou para Salvador. No reveillon eu \r\n',
 u'quero ir para Santos.']

I'll have to think about this... or wait for someone who actually knows what they're doing to fix it :)