Bookworm-project / Hathitrust-Bookworm

A full text Bookworm on Public Domain Hathitrust works
6 stars 1 forks source link

OCR quality measures #5

Open organisciak opened 8 years ago

organisciak commented 8 years ago

It's dejecting when looking lower in the global frequency list, seeing just how much space is being wasted by OCR errors vs. legitimate-but-rare words. For every hundred tokens of junk, you get one "greyish-olive" or "tetraspilus". Should we explore OCR accuracy estimation methods, so that after the top two million words or so, we can start raising our standards for what a token is? We'd be able to dig deeper down the list that way, but I'm not sure if it's a useful endeavor.

bmschmidt commented 8 years ago

If we had OCR quality estimates, I could see limiting the input texts to those with high quality scores. (Although there are problems with that).

Topic models or word2vec models might be effective at assigning such scores to documents now.

I think, though, that we could sink a lot of time into many refinements for fairly low reward. There are clear reasons to keep rare English OCR errors from swamping out common Hebrew words (or whatever), but language is doing that. The low-frequency English words aren't going to produce very good charts anyway.