assess whether the current ocr is good enough for an ocr and check workflow

fgregg commented 3 years ago

do we have ice that is good enough so it’s more efficient to check output than to do fresh transcription

tewhalen commented 3 years ago

having poked around on the OCR, here are some of the key issues/regularites i've noted:

the data is tabular
- each page has five major columns separated by thick black lines
  - the major columns have headers in larger type
  - the first major column on the left always (i think) starts with the name of a street
  - the other four columns may start with the name of a street, if it doesn't, it's a continuation of the previous column
  - columns also contain, near the top, an "Odd Nos. | Even Nos." and "New Old | New Old" header.
  - the data within major columns can be interrupted by the name of a street
- the data within the columns is rows that are /usually/ made up of four numbers (odd new, odd old, even new, even old) separated by a vertical line
- the vertical line is poorly typeset and hard for OCR systems to identify as a line
- the numbers are for the most part aligned in columns
- there are very many instances where the four columns are not what is expected
  - some times there's a note that explains the discrepancy ("cont'd 2d col", "Odd contd")
  - sometimes instead of a number there's a ditto mark
  - sometimes the number is not entirely numeric e.g. ("53 E") - and the spacing on these isn't consistent
  - sometimes there's multi-row notes that apply to multiple rows e.g. ("sw cor / Mohawk")
  - sometimes the row just contains "to", indicating (i think) that everything that's numerically between the row above and the row below maps to the same number.
the quality of the type they used sucks, the alignment is all over the place character-to-character and OCR just fails a lot.

all of this creates problems for efficient/effective OCR. The physical layout of the data on the page is super significant, and all of the OCR systems I've looked at don't do a great job of preserving it. I ended up breaking down the structure of the page as best I could using image analysis and applying OCR only to what I was able to identify as an "address mapping row" or a "street name", rather than trying to OCR the entire page and then extracting the structure from that. I do try to retain the bounding boxes of everything, whether successfully processed or rejected, so it could be referenced back to the image if necessary.

tewhalen commented 3 years ago

all that said, i haven't tried to estimate how many actual address mappings are in the entire document and compare that number to what I've been able to extract, but successful OCR of a mapping isn't the primary measure of quality, probably ease of checking/correcting the transcription is just as important.

leylabauer commented 3 years ago

Adding in some additional issues I can see with the original PDF:

page 68 is cut off on the bottom, might or might not be missing data. The entire page is sideways. Could be more instances of this, but this is just one I have noticed.
occasionally an old address number is just ", which I think means it's the same address as above.

@tewhalen what is the OCR you are using for this/I am curious to see how it is working.

tewhalen commented 3 years ago

@tewhalen what is the OCR you are using for this/I am curious to see how it is working.

The code is in github.com/tewhalen/1909 - it's a huge mess, with lots of debugging and commented out false starts, sorry. it uses tesseract as an OCR. I toyed with trying to train some ML based OCR system to do a better job reading the type, but I wasn't quite able to figure that out. The basic strategy I landed on was to:

attempt to auto-crop and auto-deskew the page images (but there's a provision to do that by hand when it fails)
automatically split them into column images
pass the column to tesseract OCR for segmentation and character recognition
use heuristics to guess which the segments of the column images probably contain addresses or street names
try to keep track of the document structure and keep what we got back in the right place (with some weak error-catching).

aoifefahey commented 3 years ago

occasionally an old address number is just ", which I think means it's the same address as above.

this is almost certainly the case. I believe it usually occurs when a single old address is split into multiple new addresses (it may also happen when several old addresses are combined to a single new address, but I don't recall any examples of that)

fgregg / chicago-historical-addresses

assess whether the current ocr is good enough for an ocr and check workflow #2