I think we should NOT include these characters in the whitelist:
\ / $ % ^ & # ! ~ £
Of course we'd need to test the effect of this change. I will try and find example files that contain these types of characters in the 'raw' unmodified tesseract output. Then compare that output with the whitelist-tesseracted output.
We should try improving our OCR output by restricting tesseract to a whitelist of characters. This StackOverflow post appears to detail how this can be done very simply/easily. http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for
I think we should NOT include these characters in the whitelist: \ / $ % ^ & # ! ~ £
Of course we'd need to test the effect of this change. I will try and find example files that contain these types of characters in the 'raw' unmodified tesseract output. Then compare that output with the whitelist-tesseracted output.