Encoding problems - Githubissues

jpfairbanks commented 6 years ago

In the file ref_lexicons/vader_words there are some emoticons that have encoding problems:

For example: :-Þ

What encoding is this file in?

cc: @cjhutto

cjhutto commented 6 years ago

I thought everything was in uft-8.

jpfairbanks commented 6 years ago

I get an error with file.readlines()

Traceback (most recent call last):
  File "reflists.py", line 21, in <module>
    print(json.dumps(wordsets, indent=2))
  File "/usr/local/lib/python2.7/json/__init__.py", line 251, in dumps
    sort_keys=sort_keys, **kw).encode(obj)
  File "/usr/local/lib/python2.7/json/encoder.py", line 209, in encode
    chunks = list(chunks)
  File "/usr/local/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/usr/local/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/local/lib/python2.7/json/encoder.py", line 313, in _iterencode_list
    yield buf + _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xde in position 2: invalid continuation byte

jpfairbanks commented 6 years ago

It is latin-1.... 😡

jpfairbanks commented 6 years ago

this is how you fix it. iconv -f ISO8859-1 -t UTF8

cjhutto commented 6 years ago

corrected in latest pull, yes? we can close this issue, I think @jpfairbanks , @scottagt

scottagt commented 6 years ago

If its been fixed / committed then sounds good

cjhutto / bsd

Encoding problems #8