Closed danvk closed 9 years ago
Presence of a well-delimited English word should guarantee that a line is not gibberish. Beyond that, maybe an HMM?
Another source of gibberish lines: overly-tall lines, e.g. book723062b/0001/010011.bin.png is 1004x172 (height should be closer to 30px).
Users seem to be doing a fine job of removing these lines by hand.
Many of these come from attempts to transcribe hand-written text, e.g.:
→ "S141~7S" (720042b) → "D)" (720042b) → "e4" (720042b) → "S1N~7W" (720042b)
Removing these would be no big loss and would make the output look better.