deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.92k stars 609 forks source link

Encoding ignored #519

Open mcp292 opened 4 months ago

mcp292 commented 4 months ago

Describe the bug The encoding argument seems to be ignored.

To Reproduce

  1. Save this image as file.png: file

  2. Run:

    
    import textract

print(textract.process("file.png"), "\n") print(textract.process("file.png", encoding="ascii"), "\n") print(textract.process("file.png", encoding="nonexistent encoding"), "\n")



**Expected behavior**
I would expect the output of the above to differ. I would expect `ascii` to not be output as a byte string. I would expect `nonexistent encoding` to error or warn and ignore.

**Desktop (please complete the following information):**
 - OS: Fedora 40
 - Textract version: 1.6.5
 - Python version: 3.12.4
 - Virtual environment: no

**Additional context**
Haven't tested commandline.