Closed tbk303 closed 9 years ago
Hi @tbk303
Docsplit has two settings that you can use to disable text cleaning. The first is the "--no-clean" flag which will explicitly disable cleaning.
There's also the "--language" setting. It's used to control which Tesseract language pack should be used for OCR. If it is specified, then Docsplit also sets "clean" to false. If you're working with German PDF's I'd recommend installing the Tesseract German language data and then specifying "--language deu" as part of the docsplit command line.
Please give one or both of those options a spin and let us know if they don't work for you.
Cheers!
Hi! I am using the language setting with a value of "deu" and I have the tesseract-deu package installed (I see docsplit correctly calling tesseract with that option in my process list), but it still replaces the umlauts.
If I explicitly disable cleaning, than it works (although it would be great to have cleaning for non-english documents as well).
I am using the docsplit library via Ruby code, not the command line utility. Maybe that makes a difference?
Aha. Yes, that'd be why. The "--language" flag only sets clean to false in the command line. If you use Docsplit as a library, you'll have to set it "clean" to false yourself.
I'll update #117 to make sure we document that as well
Ok, thanks for the help. Nevertheless, is there any chance to see cleaning for non-english ocr'ed text as well?
@tbk303 We'd be happy to take a pull request, but i'm not sure where on our schedule non-english cleaning is going to surface.
If text is extracted by OCR, it will be cleaned afterwards by Docsplit::TextCleaner which starts by converting the text to ASCII, replacing all non-ASCII characters with '?'. Therefore all German umlauts (and probably special characters from other languages, too) are lost.