ariddell / tatom

Quantitative Text Analysis for the digitale Geisteswissenschaften
https://de.dariah.eu/tatom/
47 stars 17 forks source link

UnicodeDecodeError #11

Open dkltimon opened 9 years ago

dkltimon commented 9 years ago

Hi Allen,

https://de.dariah.eu/tatom/preprocessing.html#every-1-000-words

def split_text(filename, n_words): ....: """Split a text into chunks approximately n_words words in length.""" ....: input = open(filename, 'r') ....: words = input.read().split(' ') ....: input.close()

At the place of "input = open(filname, 'r')".

I don't konw if we use "input = open(filname, 'r', encoding = 'UTF-8')" would be better.

Otherwise you may get error message: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 10: character maps to ".

ariddell commented 9 years ago

You're completely right. Thanks for the report.

I'm very used to Linux and OS X where the default encoding is frequently utf-8 and you don't need to specify utf-8 under Python 3. For the longest time I assumed that utf-8 was actually the fixed default for Python 3.