creatale / node-dv

A node.js library for processing and understanding scanned documents
Other
340 stars 72 forks source link

tesseract.tessedit_char_whitelist not working with umlauten #24

Closed HansHammel closed 9 years ago

HansHammel commented 9 years ago

German "Umlaute" (ö,ä,ü,ß,Ö,Ä,Ü) seem to be ignored by the tessedit_char_whitelist option.

WolfgangFellger commented 9 years ago

That is pretty much our use case, and it's working here... Can you give a complete example?

Shots in the dark: Are you using the 'deu' language file? Apart from that, I could imagine a hiccup if your source is not UTF-8-encoded, please check that.

Edit: Just tried again with our application, setting tessedit_char_whitelist = 'ßÄÖÜ' does give me a nice set of gibberish containing those characters.