Second, I currently have a pdf that, when put through Docsplit.extract_text, it creates a file with a null byte character. Shouldn't this be handled by TextCleaner#clean? Or do you think that the issue is within pdftotext/tesseract?
Unfortunately, the pdf that I am using is from a client and I can't provide it. I also haven't been able to manually create one that causes this.
Hello,
First of all, thank you for the gem.
Second, I currently have a pdf that, when put through
Docsplit.extract_text
, it creates a file with a null byte character. Shouldn't this be handled byTextCleaner#clean
? Or do you think that the issue is withinpdftotext
/tesseract
?Unfortunately, the pdf that I am using is from a client and I can't provide it. I also haven't been able to manually create one that causes this.