Closed o90o closed 4 months ago
I just fixed this in master, will make a new release soon. With json output, it will always keep hyphens. With plain text output, it will remove hyphens and wrap words unless you pass the keep_hyphens
flag
Thanks for finding the issue!
Firstly thanks for publishing this tool! It's already better than pdf2text from poppler, which is awesome!
There is a problem with lost hyphenations in certain documents.
Please see this PDF file:
taxation.pdf
On page 2 there are two lines:
Notice the hyphenation of the word
permitted
.The output of pdftext is missing the hyphen:
How could this be fixed? I checked the source code but there is nothing obviously removing hyphers.