VikParuchuri / pdftext

Extract structured text from pdfs quickly
Apache License 2.0
291 stars 26 forks source link

Hyphens lost #1

Closed o90o closed 4 months ago

o90o commented 4 months ago

Firstly thanks for publishing this tool! It's already better than pdf2text from poppler, which is awesome!

There is a problem with lost hyphenations in certain documents.

Please see this PDF file:

taxation.pdf

On page 2 there are two lines:

 The deduction of special professional expenses under paragraphs 1 and 2 is permit-
ted ...

Notice the hyphenation of the word permitted.

The output of pdftext is missing the hyphen:

3 The deduction of special professional expenses under paragraphs 1 and 2 is permit
ted 

How could this be fixed? I checked the source code but there is nothing obviously removing hyphers.

VikParuchuri commented 4 months ago

I just fixed this in master, will make a new release soon. With json output, it will always keep hyphens. With plain text output, it will remove hyphens and wrap words unless you pass the keep_hyphens flag

Thanks for finding the issue!