Test the effect of using character whitelists in tesseract

We should try improving our OCR output by restricting tesseract to a whitelist of characters. This StackOverflow post appears to detail how this can be done very simply/easily. http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for

I think we should NOT include these characters in the whitelist: \ / $ % ^ & # ! ~ £

Of course we'd need to test the effect of this change. I will try and find example files that contain these types of characters in the 'raw' unmodified tesseract output. Then compare that output with the whitelist-tesseracted output.

ContentMine / phylotree

Test the effect of using character whitelists in tesseract #31