deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 40 #309

Open ghost opened 5 years ago

ghost commented 5 years ago

Hey I tried using textract to simply extract text out of a document that is in japanese. I used this code:

text = textract.process(".txt",encoding="utf8")

I also but the extra option encoding= but it looks like textract is just ignoring that and keep on trying to detect it. So I then got this error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 40: character maps to <undefined>

Another thing to note is that it works perfectly when using open() and read() to get the text but whenever I use textract it just gives me the error above. Any fixes coming for this looks like a lot of people are having trouble with this encoding issue. Are there any alternatives to textract too?

cagriaslan commented 4 years ago

I also have the same problem with PDF files which involves math equations. Tried latin1 and utf-8 but still no luck.

Actually I tried to avoid non utf-8 characters by using error="ignore" like in str.decode but couldn't find where to put it.

jpweytjens commented 4 years ago

Can you provide the file that you're trying to extract as well as the following information please?

cagriaslan commented 4 years ago

The file is as follows https://github.com/cagriaslan/PdfParser/blob/master/articles/makale4.pdf All of the code can be found on my github.

jpweytjens commented 4 years ago

This is one of the most difficult problems with parsing files, finding or guessing the correct encoding. Right now, textract relies completely on chardet to do this. I'm working on an update of textract where other, hopefully more robust, methods such as UnicodeDammit are available along with the option to manually specify the encoding.

The current encoding kwarg of process confusingly specifies the desired output encoding, not the original encoding of the parsed file. In the meanwhile, you can install the latest version of textract with a small fix from git. I've added the possibility to manually specify the encoding of the parsed file. I could correctly parse your pdf file with input_encoding="utf8".

The command now works as follows:

textract.process(filename, input_encoding=None, output_encoding="utf8")
cagriaslan commented 4 years ago

Thanks I will try it out. One other solution can be giving option to skip problematic characters or replacing them as in str.decode