deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.84k stars 585 forks source link

adding encoding options for pdftotext #469

Open Enzodtz opened 1 year ago

Enzodtz commented 1 year ago

Hi,

I'm trying to use this tool to extract text from a PDF file, but it doesn't seem to support passing the encoding directly to pdftotext.

This would cause me issues with letters that aren't in the default encoding, such as ã, à, á etc. They're being saved as .

In order to fix this, I added the shell_encoding kwarg that would allow one to choose the correct encoding for the shell parser, pdftotext, in this case.

In order to do that, I also needed to refactor a little bit the argument parsing code.

Thanks.