danvk / oldnyc

Mapping photos of Old New York
Apache License 2.0
288 stars 130 forks source link

Remove gibberish lines #28

Closed danvk closed 9 years ago

danvk commented 9 years ago

Many of these come from attempts to transcribe hand-written text, e.g.:

010009 bin → "S141~7S" (720042b) 010002 bin → "D)" (720042b) 01000a bin → "e4" (720042b) 01000f bin → "S1N~7W" (720042b)

Removing these would be no big loss and would make the output look better.

danvk commented 9 years ago

Presence of a well-delimited English word should guarantee that a line is not gibberish. Beyond that, maybe an HMM?

danvk commented 9 years ago

Another source of gibberish lines: overly-tall lines, e.g. book723062b/0001/010011.bin.png is 1004x172 (height should be closer to 30px).

danvk commented 9 years ago

Users seem to be doing a fine job of removing these lines by hand.