fgregg / chicago-historical-addresses

Digitizing crosswalks of historical Chicago addresses
3 stars 0 forks source link

assess whether the current ocr is good enough for an ocr and check workflow #2

Open fgregg opened 2 years ago

fgregg commented 2 years ago

do we have ice that is good enough so it’s more efficient to check output than to do fresh transcription

tewhalen commented 2 years ago

having poked around on the OCR, here are some of the key issues/regularites i've noted:

all of this creates problems for efficient/effective OCR. The physical layout of the data on the page is super significant, and all of the OCR systems I've looked at don't do a great job of preserving it. I ended up breaking down the structure of the page as best I could using image analysis and applying OCR only to what I was able to identify as an "address mapping row" or a "street name", rather than trying to OCR the entire page and then extracting the structure from that. I do try to retain the bounding boxes of everything, whether successfully processed or rejected, so it could be referenced back to the image if necessary.

tewhalen commented 2 years ago

all that said, i haven't tried to estimate how many actual address mappings are in the entire document and compare that number to what I've been able to extract, but successful OCR of a mapping isn't the primary measure of quality, probably ease of checking/correcting the transcription is just as important.

leylabauer commented 2 years ago

Adding in some additional issues I can see with the original PDF:

@tewhalen what is the OCR you are using for this/I am curious to see how it is working.

tewhalen commented 2 years ago

@tewhalen what is the OCR you are using for this/I am curious to see how it is working.

The code is in github.com/tewhalen/1909 - it's a huge mess, with lots of debugging and commented out false starts, sorry. it uses tesseract as an OCR. I toyed with trying to train some ML based OCR system to do a better job reading the type, but I wasn't quite able to figure that out. The basic strategy I landed on was to:

  1. attempt to auto-crop and auto-deskew the page images (but there's a provision to do that by hand when it fails)
  2. automatically split them into column images
  3. pass the column to tesseract OCR for segmentation and character recognition
  4. use heuristics to guess which the segments of the column images probably contain addresses or street names
  5. try to keep track of the document structure and keep what we got back in the right place (with some weak error-catching).
aoifefahey commented 2 years ago
  • occasionally an old address number is just ", which I think means it's the same address as above.

this is almost certainly the case. I believe it usually occurs when a single old address is split into multiple new addresses (it may also happen when several old addresses are combined to a single new address, but I don't recall any examples of that)