danvk / oldnyc

Mapping photos of Old New York
Apache License 2.0
288 stars 130 forks source link

Detect wrapped lines #47

Closed danvk closed 9 years ago

danvk commented 9 years ago

Line breaks are essential for legible OCR. OldNYC currently mirrors the original line breaks from the type-written text, e.g.

E. 38th Street, east from Third Avenue. At the left
is the GQuaker House (No 201) a renovated (1937) former
tenement. The Third Avenue buildings en the right bear
No's 577 - 5 and 3; the latter being the Pet Shop.
January 7, 1939
Somach Photo Service
New York City Tunnel Authority
CREDIT LINE IMPERATIVE

This looks fine when the window is wide enough, but when it's narrow, additional line breaks have to be inserted, leading to a jagged right edge.

The solution is to "unwrap" the text, i.e. by merging consecutive lines which go most of the way to the right edge:

E. 38th Street, east from Third Avenue. At the left is the GQuaker House (No 201) a renovated (1937) former tenement. The Third Avenue buildings en the right bear No's 577 - 5 and 3; the latter being the Pet Shop.
January 7, 1939
Somach Photo Service
New York City Tunnel Authority
CREDIT LINE IMPERATIVE

This could be done by counting characters per line, or by using the original bounding boxes for each line in the image.