I'm trying to use this tool to extract text from a PDF file, but it doesn't seem to support passing the encoding directly to pdftotext.
This would cause me issues with letters that aren't in the default encoding, such as ã, à, á etc. They're being saved as �.
In order to fix this, I added the shell_encoding kwarg that would allow one to choose the correct encoding for the shell parser, pdftotext, in this case.
In order to do that, I also needed to refactor a little bit the argument parsing code.
Hi,
I'm trying to use this tool to extract text from a PDF file, but it doesn't seem to support passing the encoding directly to
pdftotext
.This would cause me issues with letters that aren't in the default encoding, such as ã, à, á etc. They're being saved as
�
.In order to fix this, I added the
shell_encoding
kwarg that would allow one to choose the correct encoding for the shell parser, pdftotext, in this case.In order to do that, I also needed to refactor a little bit the argument parsing code.
Thanks.