documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.com/docsplit/
Other
833 stars 214 forks source link

Docsplit.extract_text generates a String with a null byte #152

Open cedricpim opened 5 years ago

cedricpim commented 5 years ago

Hello,

First of all, thank you for the gem.

Second, I currently have a pdf that, when put through Docsplit.extract_text, it creates a file with a null byte character. Shouldn't this be handled by TextCleaner#clean? Or do you think that the issue is within pdftotext/tesseract?

Unfortunately, the pdf that I am using is from a client and I can't provide it. I also haven't been able to manually create one that causes this.