Computational-Content-Analysis-2020 / frequently-asked-questions

Repo to ask questions and see answers
2 stars 0 forks source link

UnicodeDecodeError #15

Open tzkli opened 4 years ago

tzkli commented 4 years ago

Hi,

When reading csv files, for example, redditDf = pandas.read_csv('data/reddit.csv', index_col = 0)

I've run into this error multiple times:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 3304: invalid continuation byte

For my own data set, I Googled the error and fixed it with

with codecs.open('xxx.csv', 'r', encoding='utf-8',
                 errors='ignore') as data2:
    df = pd.read_csv(data2, 
                   error_bad_lines=False) 

which simply skips the parts that are causing errors. It works well for my data because the problem seems to be minor for my data set (very few observations are lost), but for the reddit data, using this code results in there being only one row left in the data frame. How can I fix this?

Thanks!

bhargavvader commented 4 years ago

This is a common problem in text processing with python - I literally dealt with it today. Check out the documentation to learn more about how it happens - https://docs.python.org/3/howto/unicode.html.

Sometimes, sadly, if you don't have the right encoding, you lose data. I also end up googling to see what solution works best for the case I am working on. Some of the upcoming HWs also deal with this.