ContentMine / phylotree

A repository for ami-phylotree development
0 stars 0 forks source link

Test the effect of using character whitelists in tesseract #31

Open rossmounce opened 9 years ago

rossmounce commented 9 years ago

We should try improving our OCR output by restricting tesseract to a whitelist of characters. This StackOverflow post appears to detail how this can be done very simply/easily. http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for

I think we should NOT include these characters in the whitelist: \ / $ % ^ & # ! ~ £

Of course we'd need to test the effect of this change. I will try and find example files that contain these types of characters in the 'raw' unmodified tesseract output. Then compare that output with the whitelist-tesseracted output.