Open organisciak opened 8 years ago
If we had OCR quality estimates, I could see limiting the input texts to those with high quality scores. (Although there are problems with that).
Topic models or word2vec models might be effective at assigning such scores to documents now.
I think, though, that we could sink a lot of time into many refinements for fairly low reward. There are clear reasons to keep rare English OCR errors from swamping out common Hebrew words (or whatever), but language is doing that. The low-frequency English words aren't going to produce very good charts anyway.
It's dejecting when looking lower in the global frequency list, seeing just how much space is being wasted by OCR errors vs. legitimate-but-rare words. For every hundred tokens of junk, you get one "greyish-olive" or "tetraspilus". Should we explore OCR accuracy estimation methods, so that after the top two million words or so, we can start raising our standards for what a token is? We'd be able to dig deeper down the list that way, but I'm not sure if it's a useful endeavor.