Open tzkli opened 4 years ago
This is a common problem in text processing with python - I literally dealt with it today. Check out the documentation to learn more about how it happens - https://docs.python.org/3/howto/unicode.html.
Sometimes, sadly, if you don't have the right encoding, you lose data. I also end up googling to see what solution works best for the case I am working on. Some of the upcoming HWs also deal with this.
Hi,
When reading csv files, for example,
redditDf = pandas.read_csv('data/reddit.csv', index_col = 0)
I've run into this error multiple times:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 3304: invalid continuation byte
For my own data set, I Googled the error and fixed it with
which simply skips the parts that are causing errors. It works well for my data because the problem seems to be minor for my data set (very few observations are lost), but for the reddit data, using this code results in there being only one row left in the data frame. How can I fix this?
Thanks!