documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
832 stars 214 forks source link

German umlauts are replaced by ? after OCR #116

Closed tbk303 closed 9 years ago

tbk303 commented 10 years ago

If text is extracted by OCR, it will be cleaned afterwards by Docsplit::TextCleaner which starts by converting the text to ASCII, replacing all non-ASCII characters with '?'. Therefore all German umlauts (and probably special characters from other languages, too) are lost.

nathanstitt commented 10 years ago

Hi @tbk303

Docsplit has two settings that you can use to disable text cleaning. The first is the "--no-clean" flag which will explicitly disable cleaning.

There's also the "--language" setting. It's used to control which Tesseract language pack should be used for OCR. If it is specified, then Docsplit also sets "clean" to false. If you're working with German PDF's I'd recommend installing the Tesseract German language data and then specifying "--language deu" as part of the docsplit command line.

Please give one or both of those options a spin and let us know if they don't work for you.

Cheers!

tbk303 commented 10 years ago

Hi! I am using the language setting with a value of "deu" and I have the tesseract-deu package installed (I see docsplit correctly calling tesseract with that option in my process list), but it still replaces the umlauts.

If I explicitly disable cleaning, than it works (although it would be great to have cleaning for non-english documents as well).

tbk303 commented 10 years ago

I am using the docsplit library via Ruby code, not the command line utility. Maybe that makes a difference?

nathanstitt commented 10 years ago

Aha. Yes, that'd be why. The "--language" flag only sets clean to false in the command line. If you use Docsplit as a library, you'll have to set it "clean" to false yourself.

I'll update #117 to make sure we document that as well

tbk303 commented 10 years ago

Ok, thanks for the help. Nevertheless, is there any chance to see cleaning for non-english ocr'ed text as well?

knowtheory commented 9 years ago

@tbk303 We'd be happy to take a pull request, but i'm not sure where on our schedule non-english cleaning is going to surface.