deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.86k stars 592 forks source link

text parsers doesn't support encodings other then 'utf-8' #426

Open davidorlov12 opened 2 years ago

davidorlov12 commented 2 years ago

Describe the bug When parsing files using textract specifically '.txt' files the input/output_encoding arguments simply don't work when parsing any text

To Reproduce Steps to reproduce the behavior:

  1. Create a txt file with latin encoding
  2. textract.process('latin_encoding.txt', input_encoding='latin_1', output_encoding='latin_1')

Expected behavior parse the file successfully

Screenshots Screen Shot 2022-06-30 at 10 53 52